Friday, November 7, 2014

Why Automated Tests Suck

...or how to build a test plan that doesn't kill fairies.  This post discusses a few core automated test plan guidelines.  Bad tests can actually defeat the purpose of automated tests entirely and really bad tests make manual testing a better option (wat?).

Noise vs. Signal

Let's take a moment to compare a test plan to something that would make Carl Sagan feel like a kid in a candy store: exoplanet research!  Scientists often search for exoplanets by seeking-out an oscillating signal amid a great deal of noise.  Essentially you're looking for evidence of a wobble in the star's movement because of a planet's (or a borg transwarp nexus') gravity is tugging on it as it revolves around its star.  But at such great distances away such a signal is encumbered by a myriad of problems.  The signal to noise ratio for this analysis is less than 1%.  Finding genuine planets is incredibly difficult and false positives are common (Zarmina).  What's my point?  The more noise, the harder it is to isolate a genuine signal.  Similarly, if you have tests that fail (and it's ok, those failures don't really count) then it becomes much harder to isolate genuine test failures from test failures that "don't matter."  Particularly for a release test plan, your release test plan should have zero failures and zero skipped tests.  The more you skip tests or ignore tests, the more you erode its utility and reliability as a device that produces a definitive and objective result.

I'm a New Hire Idiot

Another reason to ensure that your release test plan has absolutely no failed or skipped tests is because your test plan should be self-evident and be independent of idiomatic understanding.  For example you should never have to say that "test x is failing because we haven't updated the x87g filter data model for the cogsworth subsystem metacontroller yet, just ignore it because I don't think anybody uses that component anyway."  This is an example of idiomatic understanding and causes the naive person to lose trust in your test plan (and pull their hair out).  When new hires join your organization (or transfers from another team) often ramping-up on idiomatic understanding is so difficult because it is fragmented, subjective, inconsistent, and exemplary of the human vices that lead to conflicts.  Make tests a priori in their function and purpose.

Tests as Strong as Bismuth!

Tests should not be so brittle that you're spending more time updating your test data than you are writing code that brings value to your users.  In fact, when this is true often the tests are abandoned and left failing and then ignored and then forgotten and yet they still linger and co-ops ask about them and get an earful about the myth of the one test and the war of the first build, but it was all for naught because Isildur was struck down by a brittle test; volatile data pierced his heart.  And yet no co-op has been found worthy to take the one test deep into JMeter and destroy its design once and for all.

One might think that tests like these ought to be removed, but it isn't clear to the average person whether that would do more harm than good.  If a test fails too often then one of two things has to happen: either you update the test, or you completely redesign the test.  Removal of a test is quite simply a step backward.  Someone saw fit to provide that coverage initially, so upon what basis does one reason that such coverage is no longer necessary?  One side-effect of test removal is that can signal to the naive person that the associated features have also been removed (then you get a call from an angry customer threatening to sue your company for breach of contract when the feature broke and was never fixed, well, more often they will simply move-on and invest their dollars elsewhere silently.  The sad thing is this will happen too with brittle constantly failing tests that go ignored).

Sometimes a test that fails too often benefits from a complete overhaul because quite likely, and I'll put it bluntly, your test is just wrong.  A test that isn't self-contained is an integration test.  Some believe integration tests are as scam, I believe they are unavoidable.  That isn't necessarily a mutually exclusive statement either come to think of it.  However, if your unit test doesn't need to consume external resources (of any kind) then ensure it doesn't.  In fact, you may want to redesign the code the test is built around. "But wait Batman!  This is looking a lot like TDD!"  You're right, it is.

Tests Express Simple Statements

Each individual test should make a specific statement that can be used to ensure a straightforward objective expectation is aligned with actual state.  Here are some examples:
  1. When I call the delete function passing existing user x to it, it should yield the HTTP 200 success response.
  2. If I call the delete function passing a non-existent user x to it, it should yield the expected HTTP 404 response.
  3. When I call the get model foo function, the response body should contain a JSON object containing at least all of the following field names.
Avoid writing tests that involve too many test conditions and assertions.  Otherwise, it starts to take on the stench of the integrated test.  This may manifest as either multiple assertion statements or for example the exclusive comparison of a result against a template.  These are both examples of testing too many conditions in one test and does not resolve to a clear statement.  For example, compare the following two statements:
  1. The result should contain the following fields in this precise order and contains nothing else besides those fields and have values match these other values precisely.
  2. The result should contain at least the following fields.
Number two is a good definition for a test whereas number one can be broken-down into 4 separate statements (and consequently 4 separate tests).  Tests that test multiple conditions are often duplicating those conditions in other tests, and when they fail it simply adds more noise that developers and QA have to sift through like a child drowning in a playpen full of plastic balls.  Test plan noise erodes usability, refined targeted tests that reflect clear and concise statements increase clarity and usability.  If you cannot express your unit test in one coherent sentence, then it's not a unit test.

It's Not My Fault! The Server Was Down

This essentially embodies the difference between a unit test and an integration test.  However, I want to attack this issue from an angle independent of vernacular.  Tests should be consistent in terms of the environment and resources upon which they depend.  The other dimension to consider besides space is time.  A good way to rattle a developer's nerves is to introduce a test that downloads JSON from some bizarre and esoteric web service in China on Wednesdays unless it's a full-moon with the exception of when Mars is at apposition precisely 39.2 hours before daylight savings.  This is a hyperbole, I don't think anyone would actually write a test like this (knock on wood), but it illustrates a scoping problem concerning time and space (e.g. China doesn't even use DST).  A reliable, self-contained, test without side-effects should not be contingent upon transient state such as specific dates, times and system locale.  With respect to the environment, all of the above can break, and it's never clear when that occurs.  A machine's internal clock can get thrown out of whack, or the system locale may be subtly different (e.g. en_CA vs en_US).

If you must test a system that relies upon something non-deterministic such as one of the following:
  1. The current time
  2. A random number
Then design TimeProvider and RandomNumberProvider interfaces and build them into your code.  In your unit test, you can fulfil these interfaces with implementations that are deterministic.

The environment and resources for which a test plan requires should be reproducible upon one machine and one machine only.  If it is absolutely necessary to introduce external dependencies then categorize your tests (discussed later).  Furthermore, tests should never assume locale or timings, you are not the master of the universe, He-Man beat you to the punch decades ago; or a century ago if this blog is still around eighty some-odd years from when it was written (that's good for a laugh); or I suppose any x number of years ago from now (your now, not my now) since it's impossible for me to predict when this post will be read (...or how a crazy person made a point by stumbling into his own dog food, tune in at 6pm tonight for the full story).  Here are some examples of unhelpful failures of tests using such kinds of resources:
  1. An authentication token used in requests expires on a specific date
  2. A DNS failure occurs for a host
  3. A static IP was changed for a host
  4. A host was down
  5. Matching localized error messages against a template
  6. Testing values from a database maintained by another team / department
  7. Using sleep to wait for some resource to become available
And suggested resolutions for these conditions:
  1. Generate a token as part of the test setup phase, destroy it as part of teardown.
  2. Use multiple DNS otherwise see #4.
  3. Never ever rely on static IPs, they're not even descriptive and you don't have a monopoly on what IT does with the infrastructure.
  4. Rather than fail the test, flag it as skipped, categorize this test elsewhere.
  5. Compare error codes, not messages
  6. Have your own copy of the database, update the database in tandem with the tests.  Consider the database as part of the coverage.
  7. Either poll or block, but don't sleep because it's not reliable to predict how much time a resource will require.  Apply a timeout, and if it elapses then flag the test as skipped, categorize this test elsewhere.
In general a test contingent upon external resources that become unreachable from time to time, if you must write such tests, they should not fail when the external resources become unavailable, rather they should be flagged as skipped. i.e. employ trinary state: pass, fail, skipped.

One may argue that it is sufficient to run the tests again if they fail because they often succeed eventually.  This is essentially saying that it's okay that these features don't always work, so don't worry if it blows-up, just keep hitting refresh until something shows-up.  Side-effects are ok, but side-effects in unit tests are just plain evil.  Don't go there.  Depending on your organization a test suite can take anywhere from half an hour to more than a day (maybe even more?).  If you expect people to simply re-run the test suite to get tests to pass, you're asking them, in the worst case scenario, to invest another day particularly if your tests aren't entirely stateless.

Now for the elephant in the room.  Ideally your tests are isolated and self-contained and have no reliance on external resources.  You might then shoot back with "But how are we supposed to test our REST API then without relying on external resources?"  You're asking the wrong question.  Solid tests are the product of a project designed from the ground-up that ritually employs Test-Driven Development.  If you find yourself in a situation where you have to employ integration tests to provide any coverage to some sub-system, then this is an example of code that is simply not testable (as far as TDD is concerned).

Categorize Your Tests

This is more of a suggestion than a rule of thumb.  TDD is not an exact science, in fact software engineering as a whole is an inexact science.  If you must have tests contingent upon transient, temporal or volatile state, then divide your test plan into categories characterized by the type of test:
  • Self-contained tests
  • Tests that interact with external resources
  • Tests involving temporal state
  • Tests that cannot be run concurrently with others
Additional to categorizing tests, thoroughly document any unreliable characteristics of the test so that the naive person will understand better what's happening when it is mysteriously skipped.  All test plan categories should be stored within the same test plan.  In other words, when someone opens the test-plan, all tests should be visible or at least evident and no tests should be hidden away in some obscure folder of the file-system.  Each category should also have a description.

Build a House then Tear it Down

Good tests should be stateless, idempotent, without side-effects.  Running one test should not subsequently affect the result of another test.  It should be possible to run any number of tests, cherry-picked or otherwise, in any order with consistent results each time.  Each and ever test should have its own setup and teardown.  Each test should setup the state it requires and teardown that state afterward leaving nothing behind that wasn't there initially and restoring anything that was there initially.  This may dramatically increase the time to execute a test plan, however a high-performance set of tests is not conducive to the fundamental purpose for a test plan.  Troubleshooting tests whose results appear to vary with each other is notoriously difficult particularly if such variance is not reflected in the purpose of the test.

Everyone Loves Minions

...or many threads to do your bidding.  Design your tests in such a manner that they can be forked and run concurrently.  The reason for this is to improve the performance of your test plan and make it scale so that you can simply farm it out to as many machines as you need to reduce that day-long test suite execution to just under an hour.  Another notable benefit of such a design depends largely on your product, but particularly for service-oriented software, concurrency is a critical factor of realistic use cases.

Here are some good design principles for a test plan that scales:

Remember Test-Driven Development?

Before you sit down to write a feature consider that it should be testable.  Think about testable code with respect to tests that can scale.  Can your code reliably service multiple requests concurrently without race conditions?  If you find it is too difficult to write tests that can be executed concurrently without randomly failing then it's likely not the test rather the feature design that's implicated.  A cornerstone of TDD is that asking questions about what you expect a sub-system to do helps influence good interface design.  This is why it's important to think about the test before implementing the feature.  The feature and its tests (the tests and its feature?) have an inextricable relationship.

Unique IDs

When you establish state for your test, or more simply put, creating records during setup, then create your entities with unique names keyed by the test name and current thread executing in the test plan.  This should be little more complex than using a name and a numeric sequence.  A numeric sequence to represent thread is more helpful than some ubiquitous unique ID because for example in the case of a race condition it can hint at the point of failure during thread interleaving.  For example, if you're testing deleting users, and the user name is the primary key, come-up with a username like "bob.smith-deletetest-thread-3."  When you tear down your state, rather than track the entities created during setup, remove all entities that match the template used to generate them (e.g. remove all "bob.smith-deletetest-thread-< any number >" entities).  This way if there are side-effects from a previous execution of the test, this will account for them.  It is also good practice to perform this before setup as well for it is reasonable to assume that the test owns any state keyed by a specific pattern that the test uses during setup especially if the name of the test is part of that key pattern.  Here is some example code:

function cleanup()
    var objects = service.getUsers();
    var re = /^bobsmith-deletetest-thread-\d+$/;
    for (var c = 0; c < objects.length; ++c)
        if (re.test(objects[c].name))
function setup()
        name: 'bobsmith-deletetest-thread-' + threading.getCurrentThreadNum()
function teardown()

Some tests by their nature imply a has before relationship, for example the "test global lockout policy" test that locks out all users for security reasons would probably cause all other tests to fail too.  It may be prudent to separate these tests into yet another isolated category so that the distinction in test characteristics is clear.

Last Word

In this post I discussed how test noise can erode the usability and reliability of a test plan.  Eventually people will learn not to trust what the tests are telling them and bugs are permitted to fly beneath the radar.  I also discussed how this is related to idiomatic understanding within the scope of the company, team, or even individual responsible for a particular test, and how this leads to conflicts and misunderstandings and broken builds and weeping release managers.  Tests should be focused and targeted and not try to account for a lot of different conditions.  Tests that involve external dependencies or temporal, transient and volatile resources are also more brittle than tests with a more self-contained environment.  I only advocate such tests if TDD is impossible due to legacy code that cannot be tenably refactored.  Categorize tests based on their nature so that if you must have tests more brittle than others, at least it's clear which ones are more brittle than others and document the why and the how (to resolve).  Tests should be tasteless, I mean stateless (it's lunchtime and I skipped breakfast).  Where possible, build tests that can be executed concurrently to improve performance and ensure that your features can handle concurrency well.