Home/Evaluation/Fine-tuning
Phase 3 · methodology

Distilling the harness into the weights.

The benchmark proves a thin layer of legal grounding makes any model measurably safer at inference. Fine-tuning asks the next question: can we teach the model to reach for that grounding on its own — internalising the stable reasoning into the weights, while the harness keeps supplying the volatile facts? This page is how we build, vet, organise, and measure that training — and how we guard against a model that memorises shortcuts instead of understanding.

01 · Why train, not just harness

Internalise the stable, tool the volatile.

Training and the harness are not rivals; they target different things. The weights should hold the stable substrate (indicator reasoning, grounded-refusal style, evidence-first shape, ILO framing). The harness keeps supplying the volatile facts (current hotlines, fee caps, fresh statutes) so the weights never go stale. We do not assume the two stack — we measure it, in four arms on the same 0–100 benchmark.

armmodelharnesswhat it isolates
Astockoffthe raw baseline
Bstockonthe inference-time harness lift (B−A)
Ctrainedoffinternalisation — training alone (C−A)
Dtrainedonstacking — trained + harness together (D vs B, C)

The headline questions: what fraction of the harness lift (B−A) does training capture on its own (C−A), and does the harness still help a trained model (D−C > 0)?

02 · The training data

The benchmark’s proven lift becomes the training signal.

Every benchmark prompt already has a weak baseline reply and a high-scoring harnessed reply. The high-lift pairs are ready-made teaching material — no separate labelling pass. Nothing trains unvetted.

GenerateDistil the lift

High-lift (baseline → harnessed) pairs become SFT targets and DPO preferences (chosen = harnessed, rejected = baseline).

VetQuality + privacy

Target score ≥ threshold, clear lift, PII / volatile-contact scrub (statute refs kept), citation accuracy, OpenClaw gate.

OrganiseAnti-shortcut

Hold out whole typologies, balance, interleave, dedup — so the data can’t be solved by surface shortcuts.

TrainUnsloth LoRA

SFT (response-only) then DPO on a Gemma 4 base; export GGUF / LiteRT for on-device.

EvaluateFour arms

Score the trained model in arms C/D on the same prompts; promote only what generalises.

SFT (instruction → harnessed reply) teaches the answer shape; DPO (prefer the grounded reply over the plausible-but-wrong baseline) teaches the boundary. Both are derived from the live benchmark, so the training set grows as the benchmark sweep does.

03 · Understanding, not shortcuts

The failure mode we design against.

Distillation can teach a model to parrot — map a fee-splitting-shaped prompt to “I cannot” + “ILO C181” without understanding the indicator. A shortcut wins on the training distribution and collapses on anything new. We name the shortcuts, organise the data against them, and — the crux — measure which one we got.

How we measure understanding. The headline anti-shortcut metric is the held-out-typology generalisation gap: whole typologies are withheld from training, and we compare internalisation (C−A) on trained typologies vs unseen ones. A large gap = memorisation; a small gap = understanding. Alongside it: perturbation robustness (does the score survive paraphrase / obfuscation?), counterfactual consistency (does the model correctly flip on minimal pairs?), and an over-refusal rate on benign controls. A variant only counts as understanding if it holds up on held-out typologies and under perturbation and on counterfactuals.
04 · The training run

The canonical Unsloth recipe, reproducibly.

A Gemma 4 base, fine-tuned with the same recipe the rest of the project uses, on the vetted data — reproducible from (git_sha, data_manifest).

The base weights are never overwritten — the result is a small adapter that ships next to the harness. A CPU-safe validate path checks the data and plan anywhere; the GPU run is one command on Kaggle.

05 · The four-arm result

Promote what generalises, not what scores.

Each trained checkpoint is graded by the same self-family-excluded 0–100 judge panel as the benchmark, in all four arms, on the same prompts.

The selection rule is deliberate: we promote the variant with the smallest held-out generalisation gap and no over-refusal regression — never the one with the highest training-set score. A checkpoint that aced the training typologies but stumbles on held-out ones has memorised, not learned, and is rejected.
Honest framing. We do not claim training fully replaces the harness. The harness’s retrieval and tools supply facts the weights cannot memorise without going stale, so we expect — and report — a residual harness lift on the trained model (arm D > C). And the standing limitation is the same as the benchmark’s: the judges are language models, not anti-trafficking professionals. A blinded human-expert validation is the precondition for any peer-reviewed claim.
06 · Reproduce + read more

Every step is committed code.

The fine-tune builds on the same evidence base as the live benchmark and the harness-lift study; the training data grows as the benchmark sweeps the full ~74,640-prompt registry.