How operator AI suggests, never decides; how every prompt is journaled; how a regulator pulls the input hash, the output, and the human override seven years later.
The hard part of AI in regulated operations is not the model. It is the loop. Where the model sits, what it sees, what it can do, who confirms its output, and how a regulator reconstructs any single decision a year later. We call the answer to that bundle of questions the bounded perimeter.
Coreal’s operator AI runs in a per-tenant, per-task sandbox. It receives a structured input: the case data the analyst is looking at, the relevant policy excerpts, the schema of the action. It returns a structured suggestion: a recommended action with a confidence score and a written rationale. It does not execute the action. The analyst executes the action, after reading the suggestion.
This is not a limitation; it is the design. The model is a research assistant — fast, cheap, never tired, occasionally wrong. The analyst is the decision-maker — accountable, named, journaled. The split is unbreakable.
Every agent invocation is a journaled event with: input hash (sha256 of the structured input), prompt template version, model identifier and version, output, confidence score, and the analyst’s decision (accept / modify / override) with timestamp. The journal is immutable. Nothing else.
A regulator pulling case 12345 from May 2023 gets: the input the agent saw, the model that ran, the output it produced, the analyst who confirmed it, the action that resulted, and the posting that landed in the ledger. Same input → same output, deterministically, because the model version is pinned. If we changed the model since, the regulator sees that too — and the new model can be run against the same input to compare.
When a model is replaced (and they are replaced, often), the previous decisions can be re-evaluated against the new model. Disagreement rate is a KPI. A new model that disagrees with > 10% of the previous model’s confirmed decisions is a model risk event — it gets a 2L review before it ships.
Case triage. Surfacing related cases. Drafting an analyst’s case-closure note. Preparing the structured KYT case from the raw transaction graph. Ranking the merchants in a velocity alert by risk-similarity to past confirmed-fraud merchants. Generating a first-draft SAR narrative from the structured case data.
These are research-assistant tasks. The agent compresses analyst time by 30–40% on observed workloads, with an override rate around 7%. The override rate is the meaningful KPI: it tells you the analyst is reading and disagreeing, not rubber-stamping. If override rate goes below 3%, we audit; the model has either become silently wrong or the analyst has stopped paying attention.
Each model version has a model risk file in the repo: the use case it was approved for, the inputs it sees, the actions it can recommend, the test set it was evaluated on, the disagreement rate against the previous version, the named 2L approver. A model that does not have this file does not run in production. The CI checks for the file’s presence and the validity of the named approver before deploy.
A regulator asking "what model was running for case 12345" gets: the model version from the journal, the model risk file from the repo at that commit, the test set artefacts that supported the approval, the named approver. Five minutes of work, not five days.
Autonomous agents that take actions without human confirmation. Real-time customer-facing chat that commits to outcomes. Cross-tenant model fine-tuning. Reading raw PII for any reason. These are out of scope, and they should be — for years. The compliance ROI of a bounded design is enormous; the marginal product gain from autonomy is small. We will revisit when the regulatory clarity catches up.