Damien Hirst at Tate Museum

Blog

Random musings, tales of my travel, etc.

Can we please put the science back into computer science?

I few months ago I watched an excellent talk by Greg Wilson called "What We Actually Know About Software Development, and Why We Believe It's True".  It calls out the lack of actual scientific data in the software industry and I haven't been able to shake it. His main point is that our industry has very little *actual scientific evidence* for its best practices. I mean, seriously, we don't actually know that agile works better than waterfall. WHAT?!? HOW IS THAT POSSIBLE?!?

Sure, a lot of the things we do may be right. But the point is, we don't actually know. Industry needs to work with academia to systematize and prove our software best practices. Wouldn't you like to know which design patterns lead to fewer bugs? Or which language leads to the fastest development? Or which organizational structure leads to the fewest meetings? Or which coding style is most readable? Or which web framework works best for your size of organization?

Here are some interesting results he cites:

  • Boehm (1975): most errors are introduced during requirements analysis and design. Not during writing and debugging. The later a bug is removed, the more it costs to remove.
  • Woodfield (1979): For every 25% increase in problem complexity, there is a 100% increase in solution complexity. It's non-linear because of interaction effects. 
  • van Genuchten (1991): The two biggest causes of project failure are poor estimation and unstable requirements, neither of which seem to be improving in the industry as a whole.
  • Thomas et al. (1997, but it hasn't been replicated yet): If more than 20-25% of a component has to be revised, it's better to rewrite it from scratch. (This study was done on software for flight avionics, so it may or may not generalize).
  • Fagan (1975 at IBM): Hour for hour, sitting down and reading the code is the most effective way to remove bugs. It's better than running the code and better than writing unit tests. 60-90% of all errors can be removed before the very first run of the program. BUT, all of that value comes from the first hour by the first reviewer. You can only read code for an hour before your brain is full - only a couple hundred lines of codes depending on skill / practice. This means we should have small patches and changes.
  • Herbsleb (1999): The architecture of the system reflects the organizational structure that built it.
  • Nagappan (2007) & Bird (2009): Physical distance between a team of developers doesn't affect post release fault rate (for Windows Vista). Distance in the org chart does. Thus, it's fine to have people remote as long as they are on the same team, just don't put developers working under different people on the same project.
  • El Emam (2001): No code (or other) metrics that were published before 2001 have any correlation with post-release error rate beyond that predicted by source lines of code (SLOC).
  • Aranda & Easterbrook (2005): When estimating the time it will take to complete a project the only thing that matters in the spec for the project is how long the writer of the spec thinks it will take (ie. the anchoring effect from psychology). "All work done to date on software estimation is pretty much pointless. All the engineers are going to give us back is what we want to hear."

BUT. And this is a big but (and I cannot lie). Science does not move forward all the time. It makes mistakes. But it acknowledges those mistakes publicly. The results listed above MAY NOT HOLD for you or in general. It's important to know the details of how each result was achieved, and what it's limitations are. Especially because this field is so new, a lot of these results are not very generalizable and may not be right. But they are a starting point. This is the way science works.

For example, there is a fairly widely quoted study that says the best developers are ten times better than the worst. This result is complete malarky. That study was done in 1968 using punchcards, it doesn't describe how "productivity" was measured. It compares the best to the worst (which will always be a big number, it should talk about the standard deviation), and sample size was 12 programmers over a few hours of coding.

tl;dr; "If we ask a question carefully, and if we're willing to be humble enough to admit when we're wrong, to look at the data. Then we can find out how the universe actually works, and then we can do things based on that knowledge."

Amen.

 

Andrew Carman