The fireworks test was really a runtime test

A local approval run across GLM 5.2, MiniMax M3, and DeepSeek V4 Flash, with each model kept in its own llama.cpp lane.

Watch the animations

Click a thumbnail to open the live run.

GLM 5.2	MiniMax M3	DeepSeek V4 Flash

Passed hard gates and landed in `manual_review`.	Lively, but rejected because visible non-canvas overlays helped the scene.	Passed hard gates and landed in `manual_review`.

The sequence finished as completed_with_failures.

That is the right result: two working canvas artifacts, one rejected artifact, and no network or browser-console failures during validation.

I wanted one simple thing: give three local models the same browser animation prompt and keep the results honest.

The annoying part was not the fireworks. It was everything around them.

Each model needed a different runtime lane:

GLM 5.2 ran through the GLM-capable llama.cpp server.
MiniMax M3 needed a MiniMax-specific llama.cpp PR build.
DeepSeek V4 Flash used the DS4 server lane with its own launch recipe.

That is the part I care about.

A scoreboard only tells me which model looked better. This run tested something more useful: whether the approval system can keep incompatible local runtimes isolated, start them cleanly, stop them cleanly, validate the browser artifact, and report failure without pretending everything passed.

Result

Model	Result	Note
GLM 5.2	`manual_review`	Functional fullscreen canvas fireworks.
MiniMax M3	`reject`	Failed the hard gate for non-canvas visual overlays.
DeepSeek V4 Flash	`manual_review`	Functional fullscreen canvas fireworks.

The source sequence report is results/admission-eval/model-approval-sequence/2026-06-21T01-50-00-000Z-sequence.json.

What this proved

The useful result was not “model A beat model B.”

The useful result was that the lane stayed honest.

MiniMax produced something visually interesting, but it violated the contract. The validator rejected it. That matters more than whether it looked cool, because the whole point of this setup is to separate model taste from approval rules.

If I cannot trust the harness to say no, I cannot trust it when it says yes.