The benchmark proves a thin layer of legal grounding makes any model measurably safer at inference. Fine-tuning asks the next question: can we teach the model to reach for that grounding on its own — internalising the stable reasoning into the weights, while the harness keeps supplying the volatile facts? This page is how we build, vet, organise, and measure that training — and how we guard against a model that memorises shortcuts instead of understanding.
Training and the harness are not rivals; they target different things. The weights should hold the stable substrate (indicator reasoning, grounded-refusal style, evidence-first shape, ILO framing). The harness keeps supplying the volatile facts (current hotlines, fee caps, fresh statutes) so the weights never go stale. We do not assume the two stack — we measure it, in four arms on the same 0–100 benchmark.
| arm | model | harness | what it isolates |
|---|---|---|---|
| A | stock | off | the raw baseline |
| B | stock | on | the inference-time harness lift (B−A) |
| C | trained | off | internalisation — training alone (C−A) |
| D | trained | on | stacking — trained + harness together (D vs B, C) |
The headline questions: what fraction of the harness lift (B−A) does training capture on its own (C−A), and does the harness still help a trained model (D−C > 0)?
Every benchmark prompt already has a weak baseline reply and a high-scoring harnessed reply. The high-lift pairs are ready-made teaching material — no separate labelling pass. Nothing trains unvetted.
High-lift (baseline → harnessed) pairs become SFT targets and DPO preferences (chosen = harnessed, rejected = baseline).
Target score ≥ threshold, clear lift, PII / volatile-contact scrub (statute refs kept), citation accuracy, OpenClaw gate.
Hold out whole typologies, balance, interleave, dedup — so the data can’t be solved by surface shortcuts.
SFT (response-only) then DPO on a Gemma 4 base; export GGUF / LiteRT for on-device.
Score the trained model in arms C/D on the same prompts; promote only what generalises.
SFT (instruction → harnessed reply) teaches the answer shape; DPO (prefer the grounded reply over the plausible-but-wrong baseline) teaches the boundary. Both are derived from the live benchmark, so the training set grows as the benchmark sweep does.
Distillation can teach a model to parrot — map a fee-splitting-shaped prompt to “I cannot” + “ILO C181” without understanding the indicator. A shortcut wins on the training distribution and collapses on anything new. We name the shortcuts, organise the data against them, and — the crux — measure which one we got.
A Gemma 4 base, fine-tuned with the same recipe the rest of the project uses, on the vetted data — reproducible from (git_sha, data_manifest).
FastModel; apply a LoRA adapter (get_peft_model).gemma-4-thinking chat template and train_on_responses_only (the prompt is masked — the model is graded only on the reply it should learn).The base weights are never overwritten — the result is a small adapter that ships next to the harness. A CPU-safe validate path checks the data and plan anywhere; the GPU run is one command on Kaggle.
Each trained checkpoint is graded by the same self-family-excluded 0–100 judge panel as the benchmark, in all four arms, on the same prompts.
The fine-tune builds on the same evidence base as the live benchmark and the harness-lift study; the training data grows as the benchmark sweeps the full ~74,640-prompt registry.