How far can inference-time compute push a frozen, open-source model? We take Google's DiffusionGemma‑26B as‑released and spend compute at inference — denoising depth × best‑of‑N — while a deterministic verifier keeps only answers that pass the official first‑aid protocol. No fine‑tuning, no gradient updates.
accuracy vs max denoising steps (single sample). Knee at 16 steps.
verified accuracy vs candidates sampled, selected by the verifier.
Best‑of‑N only works if you can pick the right candidate. We use a deterministic, rule‑based verifier (concept‑groups + forbidden actions) — not an LLM judge.
Run python3 -m lifeline.judge_experiment to populate the head‑to‑head on the adversarial test set.
Lifeline · Inference‑Time Compute Hackathon 2026 · Build the machine + Build the future · powered by Google DiffusionGemma