DEV Community

jghiringhelli
jghiringhelli

Posted on

I Had 93% Test Coverage. Then I Ran Mutation Testing.

Line coverage measures execution: what percentage of your code runs when the tests run. 93.1% means 93.1% of lines execute. That number is correct. It says nothing about whether any test catches a failure.


Execution vs. Detection

A test that calls calculateTax(100) and asserts result !== null gives you full line coverage of that function. It will not catch a wrong tax rate, a sign error, or a rounding failure. Change the return value to anything non-null and the test still passes.

That test covers the line. It does not test the behavior.


What the Experiment Found

In the multi-agent adversarial experiment from the Generative Specification white paper (https://doi.org/10.5281/zenodo.19073543), the treatment project produced a test suite with 93.1% line coverage.

Then Stryker was applied to the services layer: 116 mutants, each a realistic code change (flipped boolean, swapped operator, removed return value). Stryker checks whether any test fails when the code is mutated.

Baseline mutation score: 58.62% MSI.

A 34-point gap. That is the fraction of the codebase where a bug can be introduced, your tests will pass, and CI will go green.

Three rounds of targeted assertion improvements followed, guided by the surviving mutants. Each round replaced presence checks with correctness checks and added boundary conditions.

After three rounds: 93.10% MSI, matching line coverage exactly.

When both numbers converge, every covered line is verified. The tests no longer just execute the code; they detect when it breaks. The line coverage number turned out to be exactly the quality level the tests needed to reach.

The same pattern appeared in the Shattered Stars case study in the paper: line coverage 80%, mutation score 58%, a 22-point gap. High line coverage alongside a low mutation score is the signature of tests written to satisfy a metric.


Why This Happens

This is not unique to AI-generated tests. Human-written suites built after the implementation show the same pattern. Tests optimized for a coverage gate optimize for the gate.

The structural fix is TDD: write the test first, make it fail, write the code that makes it pass. A test written before the implementation cannot pass until the implementation is correct. The model can follow TDD instructions. It does not do so spontaneously.


The Action

Find the mutation tool for your stack. Stryker is JavaScript and TypeScript only. Pick the right one for your language.

Stack Tool Command
JavaScript / TypeScript Stryker npx stryker run
Python mutmut mutmut run
Java / Kotlin PIT mvn org.pitest:pitest-maven:mutationCoverage
C# / .NET Stryker.NET dotnet stryker
Go go-mutesting go-mutesting ./...
Ruby mutant mutant run
Rust cargo-mutants cargo mutants

Run it on your services layer. Look at the gap between your line coverage and your mutation score.

If both numbers are close, be proud. Most codebases do not start there. Every covered line is breaking a test when it changes. That is what a test suite is for.

If there is a gap, give your AI assistant the list of surviving mutants and ask it to strengthen the assertions. The same model that wrote weak tests can write better ones; it needs the mutation report to know what "better" means.

Top comments (0)