by Stewart

The fireworks test was really a runtime test

A local approval run across GLM 5.2, MiniMax M3, and DeepSeek V4 Flash, with each model kept in its own llama.cpp lane.

Watch the animations

Click a thumbnail to open the live run.

GLM 5.2MiniMax M3DeepSeek V4 Flash
Screenshot of the GLM 5.2 fireworks runScreenshot of the MiniMax M3 fireworks runScreenshot of the DeepSeek V4 Flash fireworks run
Passed hard gates and landed in manual_review.Lively, but rejected because visible non-canvas overlays helped the scene.Passed hard gates and landed in manual_review.

The sequence finished as completed_with_failures.

That is the right result: two working canvas artifacts, one rejected artifact, and no network or browser-console failures during validation.

I wanted one simple thing: give three local models the same browser animation prompt and keep the results honest.

The annoying part was not the fireworks. It was everything around them.

Each model needed a different runtime lane:

  • GLM 5.2 ran through the GLM-capable llama.cpp server.
  • MiniMax M3 needed a MiniMax-specific llama.cpp PR build.
  • DeepSeek V4 Flash used the DS4 server lane with its own launch recipe.

That is the part I care about.

A scoreboard only tells me which model looked better. This run tested something more useful: whether the approval system can keep incompatible local runtimes isolated, start them cleanly, stop them cleanly, validate the browser artifact, and report failure without pretending everything passed.

Result

ModelResultNote
GLM 5.2manual_reviewFunctional fullscreen canvas fireworks.
MiniMax M3rejectFailed the hard gate for non-canvas visual overlays.
DeepSeek V4 Flashmanual_reviewFunctional fullscreen canvas fireworks.

The source sequence report is results/admission-eval/model-approval-sequence/2026-06-21T01-50-00-000Z-sequence.json.

What this proved

The useful result was not “model A beat model B.”

The useful result was that the lane stayed honest.

MiniMax produced something visually interesting, but it violated the contract. The validator rejected it. That matters more than whether it looked cool, because the whole point of this setup is to separate model taste from approval rules.

If I cannot trust the harness to say no, I cannot trust it when it says yes.