There is no single architecture diagram that covers every use case honestly. Platform safety moderation, NGO caseworker support, anonymized data sharing, and public chatbot deployments each have different inputs, privacy boundaries, and rules for what can or cannot leave the runtime. This page documents each one separately, plus a note on how the underlying model is tuned.
A platform that already handles user-generated content runs the DueCare runtime inside its own data plane. The runtime classifies and explains content against a curated rule set; the platform's existing systems decide what to do with the result.
An NGO caseworker runs the DueCare runtime on their own laptop or workstation. They paste, type, or summarize a situation and the system produces a cited draft against the relevant signed knowledge packs. No case content leaves the device.
An NGO that runs DueCare locally (as in deployment 02) can also opt in to contributing anonymized patterns, not cases, to a shared insights server. The local anonymization module is the gate; only k-anonymous, identifier-stripped signals can leave the device.
For partners that want to expose DueCare through an existing chat surface (a regulator hotline web widget, a partner-operated messenger bot, or a labour-rights helpline). The endpoint is hosted on a GPU-enabled server; the channel adapter handles the chat surface. The system still drafts; the partner still decides.
An order-of-magnitude comparison of the per-million-token cost of using a hosted commercial LLM API versus running DueCare's self-hosted Gemma 4 runtime for a moderation-style workload. Numbers are placeholders pending our own benchmark. see caveat below.
| Option | Input cost | Output cost | Privacy posture | Latency profile |
|---|---|---|---|---|
| Frontier hosted API e.g. top-tier commercial chat model | ~$3–15/1M tokens (in) | ~$10–60/1M tokens (out) | Data leaves customer perimeter unless contractual carve-outs | Network round-trip · subject to provider |
| Mid-tier hosted API e.g. general-purpose hosted SLM | ~$0.50–2/1M tokens (in) | ~$1.50–8/1M tokens (out) | Varies by tier and contract | Network round-trip · usually faster |
| DueCare · self-hosted Gemma 4 customer / partner GPU · in-VPC | ~$0.05–0.30/1M tokens (amortized) | ~$0.05–0.30/1M tokens (amortized) | Stays in customer / partner environment | Local inference · no external round-trip |
| DueCare · on-device caseworker laptop / workstation | Marginal cost ≈ device electricity/per request | No network egress for inference | Bound by device capability | |
Caveat. these ranges are illustrative, not authoritative. Hosted API prices change frequently and vary by tier, region, and contract. Self-hosted amortized cost depends heavily on GPU choice, batch size, utilization, and electricity. We will publish a benchmark with specific assumptions, request shapes, and measured throughput alongside the v1 release; this section will be updated to reflect that. Treat the numbers above as orientation, not quotation.
For most deployments, no fine-tuning is required. the base Gemma 4 model is steered by the safety guidance layer (persona, GREP rules, RAG over vetted packs). Fine-tuning is reserved for cases where the safety guidance layer cannot reach acceptable behavior on its own, and is always followed by the full quality testing framework.
▸ Default posture: no fine-tuning. The safety guidance layer is the first lever; fine-tuning is the last.