Debugging gives us the chance to live out that dream.
When software does not do what it is supposed to do, the search for the cause begins. How quickly we find this cause depends less on how we debug than on how the software is built. Some systems readily reveal what is going wrong. Others remain stubbornly silent.
What distinguishes these two types of software? What characteristics make a system debuggable, from the architecture down to the individual code section?
Architectural level
Modularity and clear boundaries are the foundation. Systems that are broken down into independent components with defined interfaces can be examined in isolation. If an error occurs, we can quickly narrow down the search space. A monolithic system, on the other hand, forces us to keep everything in mind at once.
What happens inside a system should also be understandable from the outside. That property is called observability. Structured logging, metrics, and distributed tracing are part of this. A system that makes its state transparent shows us where things go wrong instead of leaving us to guess.
The most important of these properties is reproducibility. Deterministic builds, versioned configurations, and the ability to restore the exact same system state in which a bug occurred are indispensable. A bug that we cannot reproduce is a bug that we can hardly fix systematically.
Debugging with events
In an earlier article, I described how the shift from “we only store the current state” to “we keep a record of all events” reshaped how I think about software. Event sourcing is more than an architectural pattern. It is also a powerful debugging tool.
In a traditional CRUD system, we only see the current state: “Account balance: €50”. We do not know how it got there. If this state is incorrect, a tedious hunt through logs, database audits, and guesswork begins.
An event-sourced system, on the other hand, has a perfect memory. Every event has been stored: “Account opened”, “€100 deposited”, “€30 withdrawn”, “€20 withdrawn”. The event stream is a natural audit trail, a complete history that shows us exactly how the system arrived at its current state.
That history is a debugging asset: we can trace every single step that led to a faulty state, identify the point in time when something unexpected happened, and “play back” the system in a test environment up to that exact point to reproduce the error.
Replay testing utilises precisely this feature: we play back the entire history in a test environment and observe how the system behaves. It is at once a testing procedure and a debugging technique. If we know that the error must have occurred sometime between event 4711 and event 4800, we can perform a binary search and find the exact trigger.
Design level
Explicit state management makes debugging easier. The less hidden, distributed or implicit state there is, the easier it is to understand how the system ended up in an erroneous state. Immutable data structures and unidirectional data flow help here.
The fail fast principle states that systems should fail immediately and loudly when encountering invalid inputs or states, rather than continuing silently and propagating the error. A system that throws an exception when encountering invalid data is easier to debug than one that silently accepts the data and later fails in an unexpected place.
Code level
Meaningful names for variables, functions, and classes significantly reduce the cognitive load during debugging. If a variable is called $remainingAttempts instead of $ra, we immediately understand what it means.
Small, focused methods are easier to understand and test than those that do five different things at once. If a method has only one task, we know exactly where to look if that task is not performed correctly.
Defensive programming through assertions, preconditions and invariants makes assumptions explicit and checks them. If an assumption is violated, we find out immediately, not three call levels later.
Meaningful error messages not only tell us what went wrong, but also provide context: which values were involved, which operation was attempted, what was expected. The message “Division by zero” is less helpful than “Division by zero: attempted to compute average of 0 items in averageOrderValue()”.
Idempotence means that performing an operation multiple times leaves the system in the same state as performing it once. Such operations are easier to debug because we can repeat them safely.
Avoiding side effects makes methods particularly easy to debug. Pure functions, which depend only on their inputs and do not change any external states, always behave the same way. Their behaviour is completely determined by their arguments.
Readable control flows make it easier to understand the program flow. Deep nesting and complex conditions make debugging difficult. Early returns and guard clauses keep the code flat and linear.
The common thread
A common principle runs through all these levels: explicitness beats implicitness. The more a system openly communicates its intentions, its state and its error conditions, the less we have to guess when something goes wrong.
In another article, I described how tests are more than a verification tool: they are also specification, documentation, and communication. The same applies to debuggable software: it documents itself, communicates its states, and makes its history traceable.
Debuggability is not a feature that we add as an afterthought. It is the result of conscious decisions at every level of software development. Those who make these decisions are not investing in luxury, but in the ability to solve problems quickly when they arise. And they always do.