The best reliability fix I made in my baseball RAG project was deleting a database.
TL;DR
I removed ChromaDB from my baseball RAG system because it was doing the wrong job. It duplicated generated facts, depended on rebuildable local vector state, and made biography answers rely on an index that was not the source of truth.
DuckDB stayed. Lahman stayed as the primary factual authority. Retrosheet stayed as optional secondary evidence for biography stat claims. The LLM still writes prose, but it does not get to decide what is true.
The system
The project answers baseball questions like:
- “who had the most RBIs in 1962”
- “who played for the Braves in 1936”
- “what is OPS”
- “who was Babe Ruth”
I did not want a chatbot that simply sounded confident. I wanted a system that could show its work.
That meant answers needed evidence: SQL, rows, source metadata, checksums, warnings, and verification results. The model could help with language, but the facts needed to come from somewhere inspectable.
Where ChromaDB became a problem
ChromaDB seemed useful at first for player biographies. Biographies are fuzzy, so a vector store felt like a natural fit.
But that created a bad boundary.
The structured facts already lived in DuckDB. Player identity resolved through Lahman-backed data. Stats came from tables. Provenance came from the manifest.
ChromaDB added another place where facts could appear, drift, or go stale.
That meant biography behavior could depend on local vector index state instead of the factual path I could audit. If the index was missing, stale, or rebuilt differently, the answer could change while still looking polished.
That was the part I did not like.
The decision
I removed ChromaDB from the core runtime path.
The architecture became simpler:
Question
|
v
Router
|-- stat query -----> DuckDB
|-- database question -> typed query spec -> SQL -> DuckDB
|-- player bio -----> DuckDB identity -> LLM bio -> claim checks
|-- explanation ----> local definitions first, then LLM explanation
|
v
Structured answer with sources, warnings, and metadata
The important part is the boundary.
DuckDB answers structured baseball questions. Lahman is the primary factual authority. Retrosheet can add secondary consensus evidence for biography stat claims. The LLM can write a readable biography, but extractable stat claims get checked before the answer is returned.
That is less magical. Good.
Why this mattered
A vector index is useful when search is the main problem.
Here, authority was the main problem.
If the user asks who had the most RBIs in 1962, I do not want the answer coming from a semantic memory of baseball text. I want it coming from a table, with the SQL and source rows visible.
If the user asks who Babe Ruth was, I am fine with the model writing prose. But if that prose includes a stat, the system needs to check it.
The model can explain the evidence. It cannot replace the evidence.
The accessibility angle
This is also an accessibility issue.
A user should not need to reverse engineer the system to understand why an answer appeared. Hidden vector state, stale generated facts, and silent fallbacks all create extra work for the user.
A more accessible AI system should make recovery easier:
- What handled this question?
- What source did it use?
- Did it run SQL?
- Were factual claims checked?
- Is anything unsupported?
That is why provenance matters. It is not decoration. It is part of the interface.
Takeaway
The fix was not to make retrieval better.
The fix was to remove retrieval from the place where it weakened the system.
A RAG system does not become reliable because it retrieves more text. It becomes reliable when every part of the system knows its job.