by Stewart

The model should explain the evidence, not become the evidence

A short decision story about source authority, provenance, and keeping generated prose out of the truth path.

The model is useful when it explains evidence. It becomes dangerous when it quietly replaces it.

TL;DR

In my baseball RAG project, I had to make a boring but important decision: the model was not allowed to become the source of truth.

Lahman stayed the primary authority for structured baseball facts. DuckDB stayed the place those facts could be queried. Retrosheet could support a claim when it had relevant evidence, but it did not become a second universal truth source. The model could write and explain. It could not decide what was true.

That sounds obvious until you build the system and see how easy it is to blur the line.

The tempting version

The tempting version of a RAG system is simple:

retrieve some text
ask the model
return the answer

That works for a demo. It is not enough for a system that needs to be audited.

The hard part is not getting a plausible paragraph. Models are good at plausible paragraphs. The hard part is knowing why the paragraph should be trusted.

When the user asks, “Who had the most RBIs in 1962?”, the answer should not come from a model’s memory, a generated biography, or a loose semantic match. It should come from a query over the source data. The system should be able to show the SQL, the returned rows, and the source manifest behind those rows.

If the system cannot do that, the user has to trust the tone of the answer. That is a bad interface.

The boundary I wanted

The boundary became:

facts -> database and source manifests
claims -> checked against facts
prose -> written by the model
answer -> returned with provenance

That split gave each part of the system a job.

The database handles structured truth. The provenance layer says where that truth came from. The model turns supported facts into readable language. The API response exposes enough evidence that a user, evaluator, or future version of me can inspect what happened.

That is less impressive than a magic chatbot. It is also more honest.

Why Retrosheet stayed secondary

Retrosheet is useful. It has event-level baseball data that Lahman does not try to represent in the same way.

But useful is not the same as primary.

I did not want every source to compete for the same authority. For the core stat path, Lahman had already become the stable source. Retrosheet could provide secondary evidence for specific biography stat claims, especially when a claim needed extra confidence or a conflict needed to be surfaced.

That meant the system could say something more precise than “verified” or “not verified.” It could distinguish between a claim supported by the primary source, a claim supported by multiple sources, a conflict, and a contradiction.

Those states matter because they keep uncertainty visible.

The accessibility angle

This is also accessibility work.

A system that hides its evidence asks the user to do extra labor. The user has to wonder whether the answer came from a table, a search result, a stale index, or the model’s own memory. That is cognitive load the system should carry.

A more accessible AI system should answer these questions without making the user dig:

  • What source did this use?
  • Was the claim checked?
  • Did sources disagree?
  • Is this prose or evidence?
  • Can I trace the answer back to something stable?

That is why provenance is not garnish. It is part of the user experience.

Takeaway

The model can be a writer. It can be an explainer. It can even be a useful parser when the output is checked.

But it should not silently become the evidence.

If an AI system needs to be reliable, the truth path should be boring enough to inspect. The model can make that truth easier to read. It should not be the place truth lives.