Where do Software Bugs Come From?

At one point in my career I had the privilege of working on a software system whose output determined the route that trucks would take. Programming had always been fun and creative, but this was the first time that it also felt dangerous and scary. A bug in the truck routing system meant that objects in the real world would misbehave. Schedules would be missed, money would be spent, and customers would be angry. Also, unlike other software systems, a restart or re-deploy wouldn't fix the problem - the vehicles were already out in the world and restarting routes was not an option due to schedules promised to clients.

As a result, we took safety and quality very seriously. We had a very strong battery of tests, code was meticulously reviewed by the senior engineers and we had a team of QAs - but that didn't mean that we didn't make bugs.

Bugs != programming errors#

As software engineers we are trained to think that programming errors (e.g. null pointers, accessing items outside of an array's length, etc.) are the main kind of bugs that we should keep an eye out as they will crash our program. We made these kind of bugs too, but what I very quickly learned that although they were important to prevent, the really big bugs rarely produced crashes, instead they had to do with our routing decisions. A null pointer could be caught by error tracking tools and a fix could be re-deployed in minutes. Routing errors could only be caught after careful analysis of the performance of the routes, often after many weeks of studying the profitability of a route.

Bugs will happen#

When software systems reach a certian complexity threshold, they become difficult to work with without accidentaly introducing bugs. Part of the reason why this happens is because the system's behaviour becomes impossible to specify. There is no longer a "spec" that fits in any single person's head of how the system should behave. Instead, multiple parties that depend on the system will have different definitions of correctness.

Some of you might argue that there are techniques to preventing a system's complexity level to even reach such a point, but sometimes we are given systems that are already past the complexity threshold. This article deals with these types of systems.

How to safely develop a system whose spec you don't understand?#

Imagine you're working on a large software system, without a single person being able to determine what "correct behaviour" means, how can you safely modify the system? It turns out that there is a simple but effective technique for accomplishing this which I call the 3 changes.

The 3 changes#

Every change you make to a codebase can be split into 3 groups: deletions, additions and mutations of behaviour. Deletions will remove behaviour from the system (e.g. deleting code). Additions will add new behaviour to the system (e.g. adding new features) and finally mutations will alter the behaviour of the system.

Testing strategies for behavior deletions#

When you make a deletion, there are only 2 possibilities. Either you are deleting behaviour that is being used somewhere else, or you are deleting behaviour that is not being used anywhere.

The first possibility creates a bug, the second one is essentially deleting dead code. So verifying the correctness of a deletion amounts to making sure that the code that you are removing is not used by anyone or any other part of the system. Type checkers do most of the grunt work here, so in practice, pure deletions of behaviour generally have a low probability of producing bugs.

Testing strategies for behavior additions#

Adding behaviour to a system, often called "feature-work", has one nice property. Usually the new behaviour is introduced by either an individual or a team that work closely together and in most cases there is a spec or some way to define what it means for the new behaviour to be correct.

This is great, because it means that verifying the correctness of an addition amounts to verifying that the new behaviour conforms to the spec.

New behaviour can also typically be released in a gradual fashion (e.g. using feature flags) to further reduce the impact of bugs.

Testing strategies for behavior mutations#

Changing the behaviour of a system is the riskiest out of all the 3 types of changes. The only way of safely changing a system is by first understanding really well how it behaves under many possible inputs, what states can the system be in and how does it fail.

Takeaways#

Don't mutate the behaviour of a system unless you really understand how the system works.
Design your system in a way that new behaviour can be easily added independently of existing behaviour.
Design your system in a way that code can can be easily deleted if needed. @tef has a post about this called Write code that is easy to delete, not easy to extend.

How to safely change a software system?#

3 types of changes Every change you make to a software system can be cataloged into one of more of the following 3 buckes
additions of behaviour
mutation of behaviour
deletion of behaviour