Lifeline — Build the machine zero training

How far can inference-time compute push a frozen, open-source model? We take Google's DiffusionGemma‑26B as‑released and spend compute at inference — denoising depth × best‑of‑N — while a deterministic verifier keeps only answers that pass the official first‑aid protocol. No fine‑tuning, no gradient updates.

Model

DiffusionGemma‑26B

Training

None

Single‑shot

79.6%

Best‑of‑N verified

98.5%

Latency

0.5–1.9s

Knob 1 — denoising depth

accuracy vs max denoising steps (single sample). Knee at 16 steps.

Knob 2 — best‑of‑N (at 16 steps)

verified accuracy vs candidates sampled, selected by the verifier.

The deterministic verifier vs an LLM‑judge

Best‑of‑N only works if you can pick the right candidate. We use a deterministic, rule‑based verifier (concept‑groups + forbidden actions) — not an LLM judge.

Consistency

Deterministic vs stochastic

Reward‑hacking

Immune vs fooled by fluency

Cost / candidate

~0 vs 1 API call

Safety

Hard guarantee

Run python3 -m lifeline.judge_experiment to populate the head‑to‑head on the adversarial test set.

How to read this: a frozen open model that's right ~80% of the time single‑shot becomes ~98.5% verified purely by spending inference‑time compute — no training. Denoising depth lifts a near‑random model to its single‑sample ceiling (~68% at the knee); best‑of‑N then closes the gap to 98.5%. The effort‑manager early‑exits the moment a candidate verifies, so easy cases stay near‑instant — the "near‑0 latency" side of the bet.

Honest notes: numbers are measured on the real model; an earlier verifier under‑counted burns, so the final run re‑validates all protocols. Cross‑checked on Qwen2.5‑7B (best‑of‑N 65%→98.3%).

Lifeline · Inference‑Time Compute Hackathon 2026 · Build the machine + Build the future · powered by Google DiffusionGemma

← Try the live voice app · GitHub repo · Pitch deck (PPTX)