This article continues the line of thought I began in "Test-Driven Security" and "Turbo-Charging Your PHPUnit Suite". Read together, the three pieces argue that security and test suite performance are no longer separate quality concerns.
The case for fast test suites used to be a case about people. Developers lose flow when feedback is slow. Teams stop running suites they cannot tolerate. Engineers waste hours in aggregate that would otherwise be spent on work that matters. Every argument I made in "Turbo-Charging Your PHPUnit Suite" was, at its root, an argument about human attention.
That case is still correct. It has also stopped being the whole case.
Recently, the way serious security work is carried out has started to change. Vulnerability researchers, maintainers and security teams working on large open-source projects have started using LLM-based coding agents as a kind of tireless junior reviewer. The agent reads the code base and forms hypotheses about where weaknesses might be hiding. In more disciplined workflows, it then attempts to prove each hypothesis by writing and running a failing test. A hypothesis with a reproducer becomes a finding; one without is discarded.
That last sentence sounds familiar, and rightly so. It is precisely the approach I advocated in Test-Driven Security. The pleasant surprise is that competent agentic security review enforces the same standard from the outside that test-driven security tries to instil from the inside: no claim without an executable test.
The unpleasant surprise is what this implies for your test suite.
The new shape of the feedback loop
When code is reviewed for vulnerabilities by a human, the feedback loop is limited by attention. A reviewer examines a function, forms a suspicion, writes a small reproducer if necessary, runs it once or twice, then moves on. While the wall-clock cost of a single verification step is important, the dominant cost is human: reading, understanding, and deciding what to look at next. Test suite duration is a tax, not a ceiling.
When an agent reviews code for vulnerabilities, the feedback loop is limited by the time it takes to complete the task. The agent does not get tired, and it does not lose focus halfway through the third extension. Instead, it runs the test suite, runs it again with a candidate reproducer added, runs it again under AddressSanitizer, and finally runs the affected subset under UndefinedBehaviorSanitizer, iterating on the reproducer until it triggers cleanly or gives up.
In this loop, the test suite is not a tax. It is the ceiling. A suite that takes ninety seconds to confirm or refute a candidate finding enables an agent to verify dozens of hypotheses per hour. A suite that takes twenty-five minutes allows only a handful per day. The same code base, reviewed by the same agent, will yield more security findings in the former than the latter, not because the agent has become smarter, but because the suite got out of its own way.
From tax to ceiling
This is a meaningful shift in what slow tests cost, and it is worth being precise about it.
The cost of a slow human-facing suite increases roughly linearly with the number of runs. If your developers run the suite twenty times a day, with each run taking ten minutes, you can do the maths to arrive at the unfortunate figure I mentioned in the previous article. It is painful, but manageable.
The cost of a slow agent-facing suite is not linear. It is selective. Slow suites do more than just slow the agent down; they also alter the findings produced. Hypotheses for which the cost of verification exceeds the agent's patience budget are silently dropped from the output. You never see them. They do not appear as "deferred" or "incomplete". The only indication that they existed is if another team with a faster suite finds the same bug in their code base, but not in yours.
A slow test suite used by a human is a productivity problem you can measure. For an agent, it is a coverage problem you cannot.
What the agent actually needs
"Fast" is not a useful goal on its own. Different users of a test suite require different things from it, and the agent's requirements are specific enough to warrant their own name.
The agent needs to be able to quickly run a narrow, relevant subset of the suite. It rarely wants to run the entire suite. It wants to verify a candidate finding in a specific extension, module or component with the appropriate sanitiser instrumentation and then move on. A monolithic suite that can only be invoked as a single forty-minute job is useless, even if forty minutes is not that bad in absolute terms. The same suite, however, partitioned cleanly by directory or component so that the agent can run only what matters, would be dramatically more useful.
The agent requires deterministic tests. To an agent, a test that fails intermittently is indistinguishable from a real finding. It will waste its budget chasing a ghost. This is the same point I made in the previous article about flakiness, but in a context where it has real significance: the cost of a flaky suite is no longer measured in terms of developer frustration, but in terms of wasted security reviews.
The agent requires sufficient per-test isolation that adding a new reproducer does not necessitate an understanding of the entire suite's setup conventions. Test designs that depend on order, shared state or undocumented fixture interactions are difficult to review automatically. Lazy fixtures, in-memory substitutes for heavyweight dependencies, and tests that build only what they need, are not just good practice for humans. They are what differentiate an agent that can add a reproducer in one attempt from an agent that gives up after three.
Finally, the agent needs the suite to run cleanly under sanitisers. For C extension authors, the suite should be expected to pass under AddressSanitizer with detect_leaks=1 and UndefinedBehaviorSanitizer in the same way as a normal build. A suite that produces a steady background hum of known sanitiser warnings forces the agent to distinguish signal from noise with every run, and it will make mistakes.
None of these requirements are new. They are all restatements of advice that I have previously given in articles aimed at human readers. The difference is that they have been elevated from "nice to have" to "essential".
The same property, from another angle
Readers of "Security through chaos" will recognise the underlying pattern. Fuzzing and property-based testing share the same property of wall-clock time equalling coverage, only more starkly: a fuzzer left to run for an hour will explore more of the input space than one left to run for a minute, in a way that is approximately monotonic and measurable. The speed of the test suite sets the budget within which any iterative discovery process operates, whether the operator is human, a fuzzer, or an LLM.
What is new in the LLM case is not the principle, but who the operator is. A fuzzer is a tool that you choose to run. An LLM-based reviewer is increasingly becoming the standard initial review of your code, carried out by you, your contributors, your downstream users and anyone else who is interested enough to use one on your repository. The wall-clock ceiling that used to constrain a niche activity now constrains the default activity.
If you maintain a PHP extension, a library, or any substantial code base, the people reviewing your code for security are no longer just the ones you know about. They also include automated reviewers operated by people you will never meet, who evaluate your code against budgets that you do not set. Whether these reviewers identify the genuine bugs in your code or abandon the process before reaching them depends partly on the decisions you make regarding your test suite.
What this changes, and what it does not
The practical advice remains the same. Measure first because intuition can be misleading. Fix the test design before reaching for infrastructure. Partition your suite by component so that narrow subsets can be run quickly. Eliminate hidden integration tests masquerading as unit tests. Build lazy fixtures. Remove shared state. Run under sanitisers in continuous integration, not just during the frantic week before a release. Everything I wrote in "Turbo-Charging Your PHPUnit Suite" still applies, in the same order, for the same reasons.
The justification is what changes. Previously, the case for doing this work relied on appeals to developer experience, morale and the long-term health of the team. While these arguments are valid, they are subjective and have always been overruled in budget meetings by arguments about feature velocity. The new argument is harder to dismiss. A slow, flaky, non-isolated test suite is now a security liability in a way that it was not five years ago because it affects what automated code reviews can find.
I am wary of overclaiming here. Fast tests do not make software secure, any more than fast tests make software correct. The point I made in "Test-Driven Security" stands: security is a property, maintained by the same disciplines that maintain every other property we care about. Speed is one of those disciplines, alongside the others. It has simply moved up the list.
The maintainer's obligation
There is a slightly uncomfortable consequence of all this that I want to name directly.
For most of Open Source's history, the implicit contract between maintainers and the wider world has been as follows: we publish the code; you can read it if you wish. The quality of the reviews your code receives is limited by how much attention people are willing to give it, and this attention has always been scarce. Maintainers could reasonably assume that the security of their code was an internal concern, audited by themselves and a few interested parties.
That assumption is now incorrect. The amount of time available to review your code has, in practice, increased significantly. The constraint is no longer whether anyone is willing to look, but what the reviewer can actually run. The reviewer has some control over the answer to that second question. You have rather more.
Whether they think of it that way or not, a maintainer who keeps the test suite fast, isolated, and runnable in narrow subsets is doing security work. They are widening the window in which both human and automated reviewers can verify hypotheses against the code. Conversely, a maintainer who allows the suite to become a forty-minute monolith that can only be run as a single unit is inadvertently narrowing that window. They are not introducing vulnerabilities. However, they are reducing the probability that existing vulnerabilities will be found before they are exploited.
I do not think this means that maintainers owe the world a perfectly engineered test suite. Most of us are doing this in our spare time, and we already hold ourselves to higher standards than anyone has the right to demand. However, I do think the balance has shifted enough to make it worth stating explicitly. The next time you put off splitting the suite, removing the shared fixture, or fixing the unreliable test that everyone has learned to retry, it is important to recognise that the cost of that delay is no longer only in terms of developer hours.
Three things to take away
The arithmetic of test suite duration has not changed. What has changed is who is paying.
The work of making a suite fast is the same work it has always been, and the advice in "Turbo-Charging Your PHPUnit Suite" remains the right place to start. What is new is the second reason to do it. Fast tests have always been a feature of healthy code bases. They are now also a feature of secure ones.
If you take away only three things from this article, make sure it is these:
- Test suite speed is now part of your security strategy, not just your developer experience.
- Partitionability is as important as raw speed because automated reviewers want narrow subsets, not single heroic runs.
- The discipline that makes a suite fast is the same discipline that makes a suite trustworthy for automated review.
These are not three separate obligations. They are one obligation viewed from three different angles.