Watch the animations
Click a thumbnail to open the live run.
The sequence finished as completed_with_failures.
That is the right result: two working canvas artifacts, one rejected artifact, and no network or browser-console failures during validation.
I wanted one simple thing: give three local models the same browser animation prompt and keep the results honest.
The annoying part was not the fireworks. It was everything around them.
Each model needed a different runtime lane:
- GLM 5.2 ran through the GLM-capable llama.cpp server.
- MiniMax M3 needed a MiniMax-specific llama.cpp PR build.
- DeepSeek V4 Flash used the DS4 server lane with its own launch recipe.
That is the part I care about.
A scoreboard only tells me which model looked better. This run tested something more useful: whether the approval system can keep incompatible local runtimes isolated, start them cleanly, stop them cleanly, validate the browser artifact, and report failure without pretending everything passed.
Result
| Model | Result | Note |
|---|---|---|
| GLM 5.2 | manual_review | Functional fullscreen canvas fireworks. |
| MiniMax M3 | reject | Failed the hard gate for non-canvas visual overlays. |
| DeepSeek V4 Flash | manual_review | Functional fullscreen canvas fireworks. |
The source sequence report is results/admission-eval/model-approval-sequence/2026-06-21T01-50-00-000Z-sequence.json.
What this proved
The useful result was not “model A beat model B.”
The useful result was that the lane stayed honest.
MiniMax produced something visually interesting, but it violated the contract. The validator rejected it. That matters more than whether it looked cool, because the whole point of this setup is to separate model taste from approval rules.
If I cannot trust the harness to say no, I cannot trust it when it says yes.