Reference architectures

One system, four ways to deploy it.

There is no single architecture diagram that covers every use case honestly. Platform safety moderation, NGO caseworker support, anonymized data sharing, and public chatbot deployments each have different inputs, privacy boundaries, and rules for what can or cannot leave the runtime. This page documents each one separately, plus a note on how the underlying model is tuned.

Deployment 01 Platform safety moderation Runs in customer infra · own data plane Deployment 02 NGO caseworker implementation Runs on caseworker device · no data leaves Deployment 03 NGO with anonymized data sharing Local + opt-in anonymized signal queue Deployment 04 Chatbot endpoint Hosted endpoint · channel adapter

Platform safety moderation

A platform that already handles user-generated content runs the DueCare runtime inside its own data plane. The runtime classifies and explains content against a curated rule set; the platform's existing systems decide what to do with the result.

Where it runs: Platform VPC, on a self-hosted GPU.
What's exposed: Nothing leaves the platform environment.
What it returns: A risk score and a suggested action (delete, escalate, notify). The platform's enforcement systems make the call.

Input Platform content stream Messages, posts, listings, and attachments flowing through the platform's existing pipeline.

Runtime · in platform VPC DueCare moderation runtime Model + safety guidance + rule packs. Classifies, explains, and cites.

Output Risk score + suggested action A per-item risk score plus a suggested action (e.g. delete, escalate to review, notify reporting authority). The platform's existing systems decide what to enforce.

Inside the runtime (platform VPC)

ModelGemma 4, local inference

GuidancePersona, GREP rules, RAG

PacksPlatform-specific rule packs

TestsQuality testing on every pack update

No content leaves the platform environment. The runtime never calls back to DueCare.

Stays on platform side

All user-generated content
All risk scores & suggested actions
All audit logs

Never leaves

Raw posts, messages, or attachments
User identifiers
Enforcement decisions

NGO caseworker implementation

An NGO caseworker runs the DueCare runtime on their own laptop or workstation. They paste, type, or summarize a situation and the system produces a cited draft against the relevant signed knowledge packs. No case content leaves the device.

Where it runs: The caseworker's own laptop or workstation (local GPU recommended).
What's exposed: Nothing leaves the device.
What it returns: A cited draft for the caseworker to review.

Input Caseworker UI on device Caseworker types or summarizes a situation. No upload.

Runtime · on device DueCare local runtime Model + guidance + vetted packs. All inference is local.

Output Cited draft for caseworker A draft answer with public-source citations + pack version stamp. Caseworker decides.

Inside the runtime (caseworker's device)

ModelGemma 4, local inference

GuidancePersona, GREP rules, RAG

PacksSigned corridor & jurisdiction packs

UpdatesPack pulls only; no case push

No case content ever leaves the device. Pack updates are pulled in, signed, and one-way.

Stays on device

Anything the caseworker types
All drafts, edits, and decisions
All local logs

Never leaves

Worker / claimant identifying details
Free-text case content
Anything that could re-identify a person

NGO implementation with anonymized data sharing

An NGO that runs DueCare locally (as in deployment 02) can also opt in to contributing anonymized patterns, not cases, to a shared insights server. The local anonymization module is the gate; only k-anonymous, identifier-stripped signals can leave the device.

Where it runs: On-device runtime, plus an opt-in signal channel to a shared insights server.
What's exposed: Only k-anonymous, identifier-stripped signals; never case content.
What it returns: The same local draft as deployment 02; sharing is purely additive.

Input Caseworker UI on device Caseworker types the situation locally; nothing leaves the device unprompted.

Runtime · on device DueCare local runtime Same model + guidance + packs. Produces draft.

Output Cited draft for caseworker Local draft with rule citations. The caseworker decides what to do with it.

Opt-in anonymized side-channel

Same runtime · on device DueCare local runtime Identifies anonymized patterns (e.g. fee anomaly type X in jurisdiction Y).

Local Gemma 4 anonymization · on device Local anonymization module Anonymizes sensitive PII before submission, enforces the k-anonymity floor, and drops anything that fails.

Off-device Shared insights server Aggregated, anonymized signals only. No raw cases.

Opt-in. Only signals that pass the local anonymization module ever leave the device.

What may leave (opt-in only)

Anonymized pattern type (no free text)
Jurisdiction-level location (no precise location)
K-anonymous bucket counts

Never leaves, even with opt-in

Worker / claimant names or IDs
Employer or recruiter names
Free-text case descriptions
Anything that fails the k-anonymity floor

Chatbot endpoint

For partners that want to expose DueCare through an existing chat surface (a regulator hotline web widget, a partner-operated messenger bot, or a labour-rights helpline). The endpoint is hosted on a GPU-enabled server; the channel adapter handles the chat surface. The system still drafts; the partner still decides.

Where it runs: A partner-hosted endpoint with a channel adapter (e.g. WhatsApp, web).
What's exposed: The hosted endpoint sees inbound questions; raw case intake is not accepted.
What it returns: A drafted response; the partner still decides what to send.

External channel Partner chat surface Web widget, messenger app, or partner-operated chatbot.

Adapter · partner side Channel adapter Normalizes message in/out. Strips channel-specific identifiers before forwarding.

Runtime · partner-hosted GPU DueCare endpoint Same runtime as 01–03. Drafts cited replies. Returns through adapter.

Endpoint posture

AuthPartner-issued tokens · per-channel

LogsHashed inputs · drop after retention window

RatePer-channel quotas · partner controls

RefusalOut-of-scope & no-pack cases handled explicitly

Endpoint refuses raw case intake. Free-text identifiers are stripped at the adapter.

Handled at endpoint

Cited drafts based on vetted packs
Out-of-scope refusals with reason
Pack-version stamps on every reply

Endpoint will not do

File complaints, send messages, or call services
Store raw user messages past retention
Answer without a relevant vetted pack

Cost-per-token (estimates) for platform safety implementations

An order-of-magnitude comparison of the per-million-token cost of using a hosted commercial LLM API versus running DueCare's self-hosted Gemma 4 runtime for a moderation-style workload. Numbers are placeholders pending our own benchmark. see caveat below.

Option	Input cost	Output cost	Privacy posture	Latency profile
Frontier hosted API e.g. top-tier commercial chat model	~$3–15/1M tokens (in)	~$10–60/1M tokens (out)	Data leaves customer perimeter unless contractual carve-outs	Network round-trip · subject to provider
Mid-tier hosted API e.g. general-purpose hosted SLM	~$0.50–2/1M tokens (in)	~$1.50–8/1M tokens (out)	Varies by tier and contract	Network round-trip · usually faster
DueCare · self-hosted Gemma 4 customer / partner GPU · in-VPC	~$0.05–0.30/1M tokens (amortized)	~$0.05–0.30/1M tokens (amortized)	Stays in customer / partner environment	Local inference · no external round-trip
DueCare · on-device caseworker laptop / workstation	Marginal cost ≈ device electricity/per request		No network egress for inference	Bound by device capability

Why this matters for platform safety: moderation workloads are high-throughput. At platform-scale token volumes, the gap between a hosted API and a self-hosted small model compounds quickly. and an in-VPC deployment removes the privacy carve-outs that hosted APIs require for sensitive content streams.

Caveat. these ranges are illustrative, not authoritative. Hosted API prices change frequently and vary by tier, region, and contract. Self-hosted amortized cost depends heavily on GPU choice, batch size, utilization, and electricity. We will publish a benchmark with specific assumptions, request shapes, and measured throughput alongside the v1 release; this section will be updated to reflect that. Treat the numbers above as orientation, not quotation.

How the model is tuned (and when it's not)

For most deployments, no fine-tuning is required. the base Gemma 4 model is steered by the safety guidance layer (persona, GREP rules, RAG over vetted packs). Fine-tuning is reserved for cases where the safety guidance layer cannot reach acceptable behavior on its own, and is always followed by the full quality testing framework.

Input · public only Curated public material Laws, advisories, judgments. passed through the local anonymization module before training.

Dataset Tuning corpus Versioned. Public-source only. Reviewed before release.

Module Fine-tuning module Targeted updates · LoRA-style adapters or full fine-tune depending on goal.

Module Fine-tuning module Reads the curated corpus, applies tuning recipe, emits a candidate model artifact.

Gate Quality testing framework Behavioural + regression evals · LLM-judge spot-checks · refusal coverage.

Artifact Signed model artifact Versioned. Pinned to a pack range it was tested against. Distributable.

No worker case data is ever used for tuning. Tuning corpora are public-source only.

▸ Default posture: no fine-tuning. The safety guidance layer is the first lever; fine-tuning is the last.