The eval was technically watching the data. It was watching the wrong thing.
TL;DR
One of my baseball RAG checks warned when the dataset version changed. The idea was right. The contract was wrong.
I was hashing raw source files. That made the eval sensitive to private implementation details, not the thing the API promised to users. The better baseline was the compact provenance payload returned with answers.
Hash the contract. Not the accident of how the contract is built.
The original check
The project uses structured baseball data and returns answers with provenance. I wanted CI to notice when the factual base changed, because a changed data source can change answers.
The first version reached for the obvious thing: hash the underlying files.
That seems reasonable. If a CSV changes, the hash changes. If the hash changes, CI can warn that the baseline needs review.
But the raw file was not the public promise of the system. It was an input.
The user never sees the raw file bytes. The API consumer does not care whether a newline changed, a source export was normalized differently, or a loader got refactored while producing the same public evidence.
The thing users rely on is the provenance the system returns.
The problem with private truth
Tests can accidentally lock you to the wrong surface.
A raw CSV hash is easy to compute, but it answers a narrow question: did these bytes change?
The question I needed was different: did the factual contract exposed by the system change?
Those are not the same.
If I refactor loaders, reorder intermediate data, or normalize source files without changing the answer evidence, a raw-byte baseline can scream even though the system’s contract is stable.
Worse, a private hash can make the test feel rigorous while avoiding the real contract. It tells you something changed somewhere. It does not tell you whether the user-visible evidence changed in a meaningful way.
That kind of test creates noise. Noise teaches people to ignore the test.
The better baseline
The fix was to build the baseline from the compact provenance payload.
That payload is closer to the public contract:
answer
source authority
dataset identity
query path
evidence rows or summaries
warnings
verification status
That is what downstream code and human readers can reason about.
If this payload changes, I want CI to force a review. Maybe the data changed. Maybe a source was upgraded. Maybe the provenance contract drifted. Any of those are worth seeing.
If an internal file changes but the payload stays the same, the eval should stay quiet.
Why this matters for AI systems
AI systems already have enough moving parts. A test suite should reduce ambiguity, not add another layer of mystery.
For generated answers, the public contract is not only the final paragraph. It is also the evidence around the paragraph: citations, provenance, verification state, warnings, and failure modes.
That is the part users need when the answer matters.
A model can make a wrong answer sound smooth. A good contract makes the system show its work anyway.
The accessibility angle
Accessible systems reduce unnecessary user burden.
When an eval watches the wrong thing, failures become hard to interpret. The maintainer has to ask, “Did anything important change, or did a private file just move around?”
That burden eventually leaks to users because noisy tests lead to weaker review habits.
A contract-level baseline is easier to explain:
- This changed because the answer evidence changed.
- This changed because a source identity changed.
- This changed because a warning or verification state changed.
That is a healthier failure.
Takeaway
A baseline is a promise about what deserves attention.
If you hash private files, you promise to care about private files. If you hash the public provenance contract, you promise to care about the behavior users actually depend on.
For this project, that was the right promise.