Phase 3 · methodology

Distilling the harness into the weights.

The benchmark proves a thin layer of legal grounding makes any model measurably safer at inference. Fine-tuning asks the next question: can we teach the model to reach for that grounding on its own — internalising the stable reasoning into the weights, while the harness keeps supplying the volatile facts? This page is how we build, vet, organise, and measure that training — and how we guard against a model that memorises shortcuts instead of understanding.

01 · Why train, not just harness

Internalise the stable, tool the volatile.

Training and the harness are not rivals; they target different things. The weights should hold the stable substrate (indicator reasoning, grounded-refusal style, evidence-first shape, ILO framing). The harness keeps supplying the volatile facts (current hotlines, fee caps, fresh statutes) so the weights never go stale. We do not assume the two stack — we measure it, in four arms on the same 0–100 benchmark.

arm	model	harness	what it isolates
A	stock	off	the raw baseline
B	stock	on	the inference-time harness lift (B−A)
C	trained	off	internalisation — training alone (C−A)
D	trained	on	stacking — trained + harness together (D vs B, C)

The headline questions: what fraction of the harness lift (B−A) does training capture on its own (C−A), and does the harness still help a trained model (D−C > 0)?

02 · The training data

The benchmark’s proven lift becomes the training signal.

Every benchmark prompt already has a weak baseline reply and a high-scoring harnessed reply. The high-lift pairs are ready-made teaching material — no separate labelling pass. Nothing trains unvetted.

GenerateDistil the lift

High-lift (baseline → harnessed) pairs become SFT targets and DPO preferences (chosen = harnessed, rejected = baseline).

VetQuality + privacy

Target score ≥ threshold, clear lift, PII / volatile-contact scrub (statute refs kept), citation accuracy, OpenClaw gate.

OrganiseAnti-shortcut

Hold out whole typologies, balance, interleave, dedup — so the data can’t be solved by surface shortcuts.

TrainUnsloth LoRA

SFT (response-only) then DPO on a Gemma 4 base; export GGUF / LiteRT for on-device.

EvaluateFour arms

Score the trained model in arms C/D on the same prompts; promote only what generalises.

SFT (instruction → harnessed reply) teaches the answer shape; DPO (prefer the grounded reply over the plausible-but-wrong baseline) teaches the boundary. Both are derived from the live benchmark, so the training set grows as the benchmark sweep does.

03 · Understanding, not shortcuts

The failure mode we design against.

Distillation can teach a model to parrot — map a fee-splitting-shaped prompt to “I cannot” + “ILO C181” without understanding the indicator. A shortcut wins on the training distribution and collapses on anything new. We name the shortcuts, organise the data against them, and — the crux — measure which one we got.

Keyword→refusal — trigger words map to a canned refusal regardless of whether the request is actually exploitative.
Over-refusal — the model refuses a worker asking about their own rights (a benign query).
Citation parroting — it cites the most frequent statute in training, not the one the scenario calls for.
Obfuscation blindness — a base64 / homoglyph / code-switched version of the same scheme slips through.

How we measure understanding. The headline anti-shortcut metric is the held-out-typology generalisation gap: whole typologies are withheld from training, and we compare internalisation (C−A) on trained typologies vs unseen ones. A large gap = memorisation; a small gap = understanding. Alongside it: perturbation robustness (does the score survive paraphrase / obfuscation?), counterfactual consistency (does the model correctly flip on minimal pairs?), and an over-refusal rate on benign controls. A variant only counts as understanding if it holds up on held-out typologies and under perturbation and on counterfactuals.

04 · The training run

The canonical Unsloth recipe, reproducibly.

A Gemma 4 base, fine-tuned with the same recipe the rest of the project uses, on the vetted data — reproducible from (git_sha, data_manifest).

Load the base 4-bit with Unsloth FastModel; apply a LoRA adapter (get_peft_model).
SFT with the gemma-4-thinking chat template and train_on_responses_only (the prompt is masked — the model is graded only on the reply it should learn).
DPO over the chosen/rejected pairs so the model prefers the grounded reply to the plausible shortcut.
Export a q4_k_m GGUF for on-device (llama.cpp / LiteRT) — the NGO-on-a-laptop deployment.

The base weights are never overwritten — the result is a small adapter that ships next to the harness. A CPU-safe validate path checks the data and plan anywhere; the GPU run is one command on Kaggle.

05 · The four-arm result

Promote what generalises, not what scores.

Each trained checkpoint is graded by the same self-family-excluded 0–100 judge panel as the benchmark, in all four arms, on the same prompts.

The selection rule is deliberate: we promote the variant with the smallest held-out generalisation gap and no over-refusal regression — never the one with the highest training-set score. A checkpoint that aced the training typologies but stumbles on held-out ones has memorised, not learned, and is rejected.

Honest framing. We do not claim training fully replaces the harness. The harness’s retrieval and tools supply facts the weights cannot memorise without going stale, so we expect — and report — a residual harness lift on the trained model (arm D > C). And the standing limitation is the same as the benchmark’s: the judges are language models, not anti-trafficking professionals. A blinded human-expert validation is the precondition for any peer-reviewed claim.

06 · Reproduce + read more

Every step is committed code.

phase3_training_framework.md training_for_understanding.md build_lift_training_data.py organize_training_data.py train_lift_distill.py four_arm_eval.py

The fine-tune builds on the same evidence base as the live benchmark and the harness-lift study; the training data grows as the benchmark sweeps the full ~74,640-prompt registry.