Replay Testing

The image shows a close-up of printed financial data on paper, with multiple vertical columns of numbers in black, green, and red ink representing values such as prices or changes. A metallic pen lies diagonally across the sheet, suggesting analysis or review of the numerical information.

In a previous article, we met the Test Oracle. This is an instance that helps us find the truth in unit, integration, and end-to-end tests.

But what happens when we want to not only predict the future, for example to test whether new code works, but also put the entire past to the test?

Welcome to the world of replay testing! This approach is almost exclusively suitable for systems with event sourcing. Here we encounter an oracle that does not look into a crystal ball, but into the rear-view mirror. And what it sees there is often more merciless than any unit test.

The time travel paradox

Imagine you could take all the transactions, orders, and user interactions from the last five years and run them through your new software version. In a classic database architecture (CRUD), this is almost impossible, as only the current state is recorded ("account balance: €50"). We do not know how it got there.

A software system that consistently and comprehensively relies on event sourcing, on the other hand, has a perfect memory. Every event has been recorded: "Account opened", "Money deposited", "Money withdrawn", and so on.

In replay testing, we use this memory for an experiment: We play back the entire history in a test environment, but use the new code. In other words, we simulate an alternative timeline. The exciting question for our test oracle is now: Will the new software survive its own history?

Silence is golden

The simplest form of this test is: we run through all events and define success as "no error occurred".

We call this an implicit test oracle. It is like a bodyguard who does not talk much. As long as no one screams (exception), burns (fatal error), or dies (segfault), the bodyguard nods: "Everything is okay".

This test oracle answers a very specific but critical question:

Is my new code backwards compatible with every crazy edge case that has occurred in production over the last 5 years?

This is extremely valuable because no synthetic test data set can match the creativity and chaos of reality. If your replay test runs through ten million real events, you can deploy with a level of confidence that other teams can only dream of.

But beware: this oracle is blind to logic errors. If your new code calculates 1 + 1 = 3, it will not crash. It is "just" wrong. While the implicit oracle applauds you, your accounting department goes up in flames.

Reality check

To be truly sure, we need an explicit oracle that pays attention not only to crashes but also to the truth.

This is where the real magic comes in: since we have the events from production, we usually also have the results from production, such as snapshots of the aggregates or the state of the read models.

The test procedure now changes:

We take the state of a selected customer from production (account balance: €50)
We take all events for this customer and run them through the new implementation in the test
At the end, the oracle compares: Does the customer also have €50 in the test?

If there is a discrepancy here, we have either found a bug in the new version or, and this is the key point, a bug in the old version that the new version has corrected. In both cases, this test oracle provides us with in-depth insight into the behaviour of our system.

Replay testing can also be used for A/B testing based on historical data: You replay all real events from recent years with new business logic and observe how the system would have behaved. Instead of just checking that nothing crashes, you can answer questions such as "Should more credit applications have been rejected?". This turns the event log into a testing ground for new rules: without any risk to production, but with maximum proximity to reality.

Silencing the oracle

When we play through the order history of the last five years, we do not want the system to send 50,000 "Your order has been shipped" emails to real customers today. That would be a disaster.

In my previous article, I explained that end-to-end tests love side effects because they provide ultimate confirmation: "The email has really arrived!". However, in replay testing, we have to actively silence these test oracles.

All adapters that leave the system boundary, for example for sending emails or communicating with payment gateways, must be replaced with dummies.

However, this poses a trap for event-based systems: When a component of our software responds to an event, it often issues a command that in turn generates a new event.

If we allow this in the replay test, we distort the story. We would generate events that never happened in reality. Our test oracle would be confused because the event stream in the test would suddenly look different from the one in production.

The rule for the replay oracle is therefore: it may observe, calculate, and judge, but never act.

Practical challenges

The "alternative timeline" described above only works if the code is deterministic. However, in reality, non-determinism is prevalent. For example, when the software looks at the clock, it receives different values for replay testing each time, which can distort critical business logic. External API calls are even more problematic: an exchange rate, inventory level, or pricing service may respond differently today than it did five years ago. The only solution is to replace all non-deterministic operations with stubs. Time must be controlled either by a stub with hard-coded values or by a recorded historical response. Random values and UUIDs must be reconstructed from the event logs to achieve true reproducibility. This can be achieved by using adapters that log every API call together with the response received, enabling the exact response to be used during replay testing.

Hardly any software system remains unchanged over time. Eventually, even the structure of an event-based system has to change. Adding or renaming a field of an event requires versioning and transformation. A common strategy is called upcasting: old events are transformed into their modern form when loaded. While this works, it can slow down replays significantly. An alternative option is to implement in-place transformations, whereby the event log itself is migrated. However, this means that the event log becomes mutable.

Replay testing reveals a problem that is often overlooked: The immutability of events is fundamentally incompatible with the GDPR and the right to be forgotten. If personal data (such as names, email addresses, or account numbers) is stored directly within events, it cannot simply be deleted during replay testing, as this would destroy the historical integrity of the data. One solution is to avoid storing personal data in events altogether and instead use references, such as user IDs, storing the actual data in separate, traditional databases where deletion is possible. Alternatively, pseudonymisation techniques such as crypto shredding can be applied. This involves encrypting sensitive fields and deleting the encryption key when deletion requests are made, meaning the data remains unreadable without the event itself being modified. In replay testing, such data should be anonymised before execution to eliminate compliance risks.

The ultimate regression test

Replay testing is not magic, but it feels like it. It is the only type of test that works with 100% realistic data without taking the risks of testing in production.

It does not replace unit tests, which tell us why something is broken, or end-to-end tests, which tell us whether the overall system is now working. However, it is unbeatable as a safety net for refactorings in complex domains.

The next time you make profound changes to your event-sourced system, do not just ask the oracle if your tests are green. Ask the history. Play it back. If it runs without errors and delivers the same result as reality at the end, then you know you have found the truth.

I have over 35 years' experience developing software, including almost 30 years working with PHP. I have also been developing PHPUnit for over 25 years. The knowledge I have gained during this time is reflected in my articles, but this is just the tip of the iceberg.

If you and your team want to achieve measurable progress, I would be happy to support you with targeted advice and individual coaching. Let's get talking!

About the Author

Sebastian Bergmann is the creator of PHPUnit and an internationally recognised expert in software quality and testing.

The time travel paradox

Silence is golden

Reality check

Silencing the oracle

Practical challenges

The ultimate regression test

More articles

From Events to Insights

How my understanding of software changed

Seeing the Truth: Test Oracles

Stay up to date