Why 89% Is the Ceiling for Label-Based Compliance Models
Every supervised machine-learning model in compliance shares a fundamental limitation: it can only be as good as its training labels. If a model is trained to predict whether a compliance document package is "adequate" or "inadequate," the ceiling for that model is set by the consistency of the human labellers who created the training data. And in compliance, that ceiling is remarkably low.
The Inter-Annotator Agreement Problem
In our research at DeepHavn, we conducted a study where 40 experienced compliance analysts were asked to evaluate the same set of 2,000 compliance document packages. Each package was reviewed by at least three analysts, who independently classified it as adequate or inadequate for a hypothetical correspondent banking relationship.
The inter-annotator agreement rate was 89.2 percent. In other words, compliance professionals -- people with years of experience making these exact judgements -- disagreed with each other on roughly one in ten packages. This is not a failure of competence. It reflects the inherent subjectivity in compliance assessments, where reasonable professionals can reach different conclusions based on the same evidence.
What This Means for Supervised Models
If humans agree only 89 percent of the time, a supervised model trained on those labels cannot meaningfully exceed 89 percent accuracy. It can learn to replicate the majority opinion, but the 11 percent of cases where labellers disagree represents irreducible noise in the training signal. Any accuracy improvements beyond 89 percent on a held-out test set are likely overfitting to the idiosyncratic preferences of specific labellers rather than learning genuine compliance principles.
This is the labeller ceiling, and it affects every label-based compliance ML system in production today. Whether the model uses gradient-boosted trees, LSTMs, or transformer architectures, the training signal is bottlenecked by human agreement rates.
The OGEE Framework: Training on Outcomes
AICIL takes a fundamentally different approach through the OGEE (Observable, Generalizable, Experiential Elements) framework developed by DeepHavn. Instead of training on human labels, we train on outcomes -- the actual results of compliance decisions as they propagate through the correspondent banking network.
When a compliance package is submitted to a correspondent bank, one of four things happens: the payment is approved, delayed for additional information, returned, or flagged for regulatory review. These outcomes are objective, unambiguous, and directly tied to the quality of the compliance package. A package that gets approved by five different correspondent banks across three jurisdictions is objectively adequate -- no human labeller needed.
Breaking Through the Ceiling
Outcome-based training sidesteps the labeller ceiling entirely. The training signal is not "did a human think this was adequate?" but "did the correspondent banking network accept this?" The distinction matters because the network aggregates the judgement of thousands of compliance officers across hundreds of institutions, producing a signal that is far more robust than any individual labeller.
Our preliminary models trained on outcome data achieve 94 percent accuracy on predicting package acceptance, with a target of 97 to 99 percent as the training corpus grows. The key insight is that compliance expertise is tacit knowledge -- it lives in the collective behavior of the banking network, not in the explicit labels of individual analysts. OGEE captures that tacit knowledge by observing outcomes at scale.
Implications for the Industry
The labeller ceiling has implications beyond AICIL. Any financial institution relying on label-based ML for compliance decisions should understand that their models have a hard accuracy cap determined by human agreement rates. The path to higher accuracy is not better models or more data -- it is better training signals. Outcome-based training is one path. The industry needs to explore others.