Saturday, February 8, 2014

joel on software and the quixotic nature of complete testing

UPDATE: I (re-)found some links that make great, thoughtful points about the topic of unit testing: Contrarian Software Development's "Unit Testing Sucks" (that site is also the site of a great quote by Bob Walsh: "Test Drive Development is like grammar driven literature.") Writing Great Unit Tests: Best and Worst Practices is a more optimistic view about doing it right, while acknowledging the giant gulf of doing it wrong that must be avoided...

For a while now I've been trying to reconcile my skepticism about most automated testing with the importance many smart people (including hiring-type folks!) place on it. Some of the problem is that I've never seen it done really well, so at this point I lack the experience to safely find a way to those safe, refreshing waters between
  • the Scylla of tests that work over trivial functionality and 
  • the Charybdis of test that blow up even with the most well meaning of refactorings. 
In practice I've seen tests as a way of multiplying the workload of active development, and then teams stubbing them out or otherwise ignoring lots of "red" results because while the software is actually doing its job, the tests aren't being kept meaningfully aligned with that.

I keep on pondering... though I live in fear that I'll never know if I'm just rationalizing my own laziness and prejudice for "making stuff!" over "engineering", or if I can be confident enough in my own experience to really justify a stance that doesn't love unit tests.

Thinking about my own coding process: I break what I want the software to do into almost ridiculously small pieces and so have a super tight code-run-inspect results loop... (I really hate situations where the a whole aggregate of code is not working but I have no idea where my assumptions are incorrect. (Pretty much all debugging is finding out which of your assumptions you made in code form is wrong, and when.)) So as I code up a complex bit, I write a lot of "runner" code that exercises what I'm doing, so that I can get to a point where I'm looking at the results as quickly as possible. This might be where I part ways from the Flock: I view this code as scaffolding to be thrown away once the core is working, but the Unit Testing faithful would have me change that into a test that can be run at a later date. There are two challenges to that: one is most of my scaffolding relies on my human judgement to see if it's right or wrong, and the other is my scaffolding is designed for the halfway points of my completed code. Parlaying it into a test for the final result that gives a yay or a nay seems tough; doing so in away that survives refactorings and also does a good job of evaluating the UI aspect of it (often a big part of what I'm working on) seems almost impossible.

Some of my skepticism, too, comes from the idea that... small bits of code tend to be trivial. It's only when they're working together that complexities arise, chaos gets raised, and assumptions get challenged. So I'm more of a fan of component testing. And it's great that you've made your code so modular that you can replace the database call with a mock object but... you know, when I think about the problems I've seen on products I've built, it's never stuff in these little subroutines or even the larger components. It's because the database on production has some old wonky data from 3 iterations ago that never got pushed to the DBs the programmers are using. It's because IE8 has this CSS quirk that IE9 doesn't. In other words, stuff that automation is just terrible at finding.

Two other things add to the challenge:
  1. A coder can only write tests for things he or she can imagine going wrong. But if they could imagine it going wrong, they would have coded around that anyway.
  2. It's very hard to get someone to find something they don't really want to find. (i.e. a bug in their code.) This is why I only put limited faith in coders doing their own testing, at least when it's time to get real.
So, this is where I am now. But I'm defensive about it, worried it makes me look like a hack... and to some extent I can be a hack, sometimes, but I'm also a hack who is good at writing robust, extensible, relatively easy to understand code. Anyway, to defend myself, I've sometimes paraphrased this idea that "any sufficiently powerful testing system is as complex (and so prone to failure!) as the system it's trying to test." But I may have just found the original source of this kind of thinking, or at least one of the two... it comes from part of a talk Joel Spolsky gave at Yale:

In fact what you'll see is that the hard-core geeks tend to give up on all kinds of useful measures of quality, and basically they get left with the only one they can prove mechanically, which is, does the program behave according to specification. And so we get a very narrow, geeky definition of quality: how closely does the program correspond to the spec. Does it produce the defined outputs given the defined inputs.
The problem, here, is very fundamental. In order to mechanically prove that a program corresponds to some spec, the spec itself needs to be extremely detailed. In fact the spec has to define everything about the program, otherwise, nothing can be proven automatically and mechanically. Now, if the spec does define everything about how the program is going to behave, then, lo and behold, it contains all the information necessary to generate the program! And now certain geeks go off to a very dark place where they start thinking about automatically compiling specs into programs, and they start to think that they've just invented a way to program computers without programming.
Now, this is the software engineering equivalent of a perpetual motion machine. It's one of those things that crackpots keep trying to do, no matter how much you tell them it could never work. If the spec defines precisely what a program will do, with enough detail that it can be used to generate the program itself, this just begs the question: how do you write the spec? Such a complete spec is just as hard to write as the underlying computer program, because just as many details have to be answered by spec writer as the programmer. To use terminology from information theory: the spec needs just as many bits of Shannon entropy as the computer program itself would have. Each bit of entropy is a decision taken by the spec-writer or the programmer.
So, the bottom line is that if there really were a mechanical way to prove things about the correctness of a program, all you'd be able to prove is whether that program is identical to some other program that must contain the same amount of entropy as the first program, otherwise some of the behaviors are going to be undefined, and thus unproven. So now the spec writing is just as hard as writing a program, and all you've done is moved one problem from over here to over there, and accomplished nothing whatsoever.
This seems like a kind of brutal example, but nonetheless, this search for the holy grail of program quality is leading a lot of people to a lot of dead ends.

I would really appreciate feedback from veterans of the testing wars here, either people who see where I'm coming from, or who vehemently disagree with me, or best yet who see where I'm coming from but can show guide me to the Promise Land of testing feeling like a good of knowing code is working and communicating with other developers rather than an endless burden of writing everything twice and still having the thing flop in production.

4 comments:

  1. I used to believe #1 until I wrote a test and found I'd incorrectly handled a case I'd considered. I'd made an error in my logic to handle the case.

    Also, this ignores the idea that you write a test when you fix a bug to demonstrate that you understand the bug and have addressed it properly. In this case, you didn't foresee the problem.

    Third, starting with test cases means you think through the different scenarios your code needs to handle. When I interview a candidate with a coding problem, it's a really good sign if they start by coming up with cases.

    This also misses that tests aren't necessarily for you but for other developers: tests can act as sample code so people can see how to call and interpret the return values. They are also helpful when performing code reviews to demonstrate to the recipient that the code works. On a couple occasions, I've examined people's code and demanded unit tests for cases I know will fail because they hadn't considered the case in their logic.

    ReplyDelete
  2. Another example because this literally came up this week for me:

    Engineer 1 wrote a method which I wrote some code that relied on it.
    Engineer 2, 4 days later, rewrote the method and swapped the order of 2 of the 4 returned arguments (this was scala) *and* didn't update my code or the comment block describing the return values.

    My code worked perfectly fine but gave incorrect results., causing me and engineer 1 about 4 hours of head scratching on Tuesday.

    If a test existed for the method, it would have been immediately obvious that the return values had been swapped.

    ReplyDelete
  3. Sometimes, having thorough unit tests is extremely difficult, and UI code is one of those places. When you're writing Java, we have some pretty sophisticated tools these days, to write nicely isolated unit tests that deliver a good deal of value for the effort required to write them. Tools like DBUnit, Arquillian, EasyMock, to name a few I'm using currently, allow me to test exactly the piece of code I'm trying to test, so I don't get spurious failures. When I'm working on things that ultimately execute in the browser, like CSS, HTML, and Javascript, however, I don't have those tools. qUnit isn't bad, but it's only going to help with the JavaScript. It really is important not to spend more effort on the test, than it's going to save you to have that test written. I think that's where some of the TDD devotees get it wrong. Something that's difficult to test doesn't always indicate a poor design. Sometimes it merely indicates a lack of robust testing tools for the technology I'm working in. And sometimes it indicates that the test itself isn't a unit test at all, but an integration test.

    With that said, though, I have to say that I do think good unit testing is important, and where the tools exist to allow me to do it, I not only take advantage of it, I expect the rest of my team to do so, as well. As a Tech Lead, when a developer checks in code that doesn't have a unit test, I ask them to write one. I don't expect a set of unit tests to cover every possible eventuality the code might have to consider. That's just plain unrealistic. I also don't bother writing unit tests for truly trivial code. The common form Getters/Setters don't need them, for example.

    Ultimately, we want to write software that works, and the only way to be certain it works, is to see it perform it's job. Tests allow us to do that. More interestingly, tests that are part of the build allow other members of the team to see my code working, and their tests allow me to see their code working. It speeds my debugging tremendously, when I can look at a unit test, and see what is being exercised, and rule out misbehavior in some code segments because of it.

    ReplyDelete
  4. It sounds to me like one thing you're wrestling with is the definition of a "unit". When I write a JPA object, and the tests for it, what is it I'm trying to test? Certainly not the "code" of the entity object, it's a simple pojo bean, with almost no logic. But I mapped that entity with some relationships, and I want to know that when I retrieve my object, I can get to it's children correctly. So I want the test to look up an object that has children, and verify that all the children I expect are there. Later, when I'm writing some code that lives in the middle tier, and uses that entity, I don't need to add code there that verifies it can read the children from the entity. But suppose that code's job is to take a list of N children, and sort them among M parent objects based on some subset of their properties. It's useful to be able to test that with the persistence tier "unhooked", because it tightens the code-test-verify loop dramatically, and because it doesn't rely on database state to run, it's repeatable. So when someone else breaks inadvertently breaks my code with a refactoring elsewhere, they find out as soon as they run the test.

    It's important to understand that unit tests don't eliminate the need for higher level testing. What they do, though, is make it more likely that those tests will succeed in the majority of cases. And since those higher level test cycles are a much longer loop, anything you can do to make them more likely to succeed that's based on the tighter loop you have at your desk is going to be an overall win.

    The project I'm on right now, we've been doing really extensive testing, using Arquillian, DBUnit (using the Arquillian integration for it), and EasyMock. I use reflection to test private methods so I don't have to make the public just for testing purposes. I use EasyMock to replace layers, so that, for example, I can verify that code that's meant to send emails works correctly, without spamming someone with emails on every build. DBUnit allows me to control the state of the database immediately prior to the test, so I can verify my JPA mappings work correctly, and that my queries return the data I expect. Arquillian lets me run the test within the container, so I can leverage injection, JNDI values, and so forth when I need them, but still do so isolated to only exactly what I need. And here's the thing - development is going faster, because everyone on the team finds more bugs earlier, and can fix them while the relevant code is still fresh in their mind. We've reached a high level of quality, even though some members of the team are using some technologies for the very first time. And when I go sit with someone to help them through a tough bug, having a unit test that reproduces the bug makes that go faster, as well, so I spend less time troubleshooting other people's code with them, and more either writing my own code, or explaining to them how this new (to them) technology works.

    ReplyDelete