The Science of Calibrated Trust

The Overconfidence Problem

Large language models are systematically overconfident. When a model says 95% confidence, the answer is right about half the time. A gap that size breaks every system downstream that trusts those numbers.

The gap is worse than people think. GPT-4o mini self-reports 90-100% confidence on entity extractions where actual accuracy sits around 35%. Single judge models assign confidence 1.0 to 86% of entities they extract, but only 40% of those entities are correct. The error is structural, not noise.

Published guard models are no better. Llama Guard, Shield Guard, and Wild Guard, the standard open-source safety classifiers, have Expected Calibration Error (ECE) of 14-28% [2]. ECE measures the average gap between predicted confidence and observed accuracy. An ECE of 20% means the model's confidence is, on average, 20 points away from reality:

$$\text{ECE} = \sum_{b=1}^{B} \frac{|B_b|}{n} |\text{acc}(B_b) - \text{conf}(B_b)|$$

If you know how to set a threshold, you set a threshold. But if your model's confidence scores bear no relationship to correctness, you are filtering on noise. You will miss errors the model is confidently wrong about and flag correct outputs it happens to be uncertain about. The threshold gives the appearance of quality control while providing none.

There is also a selection bias that compounds the problem. Extraction models do an implicit top-K filtering before confidence is ever estimated: they only output entities they believe are correct. The confidence model never sees clearly wrong extractions. It evaluates a pre-filtered distribution where most inputs are plausible. A perfect extractor paired with a perfect confidence model would just output "correct" for everything, trivially calibrated but completely uninformative. The practical consequence: precision can be improved after the fact through calibration, but recall cannot. If the extraction model suppresses uncertain outputs before they reach the confidence estimator, those potential corrections are lost permanently.

RLHF makes this worse. The GPT-4 technical report documents calibration degradation after human feedback training [5]. RLHF optimizes for outputs that sound confident and authoritative, because human annotators reward those qualities. The model learns to express certainty as rhetoric, decoupled from actual accuracy. That is RLHF working as designed: the direct consequence of optimizing for human preference rather than calibrated truthfulness.

Three Methods for Confidence Estimation

The methods divide by access level: how much of the model's internals you can see. Each level trades deployment complexity for calibration quality.

Black-box: LLM-as-Judge Ensemble

Multiple diverse small LLMs vote independently on whether an extraction is correct. Confidence comes from the frequentist average of their votes. No training required. You use the models off the shelf [10]. The key is ensemble diversity: models with different architectures, training data, and failure modes. With a diverse ensemble, ECE drops from 0.6 to 0.4, and AUC improves from 58% to 68%. The models disagree on different things, and disagreement is a genuine signal of uncertainty [6].

Gray-box: Feature-Based Model

Construct a feature vector from everything the model exposes short of its internal weights: token log-probabilities, softmax entropy, decoding statistics (beam width, number of tokens considered). A lightweight regression model, logistic regression or a small gradient-boosted tree, predicts correctness from these features. One practical trick: GLiNER's threshold-zero hack produces continuous confidence scores from models that would otherwise output binary judgments. This level requires API access to log-probs but no access to hidden states.

White-box: Linear Probes

Train a linear probe on the LLM's hidden representations, roughly a 4000-dimensional vector per token at each layer [4]. On MMLU Pro, probes achieve ECE of approximately 0.02. That is near-perfect calibration. The key discovery: prompting the LLM to verbalize its uncertainty ("How confident are you?") improves the probe's performance even though the verbalized number is not used as a feature. The act of introspecting changes the hidden state in ways that make the probe's job easier. This is "latent feature engineering": you modify the model's internal representations by changing the prompt, then read off the improved signal.

Knowledge Graph Uncertainty Propagation

Extraction produces entities and relations, each with a confidence score. These scores are not independent. A relation between two entities depends on both entities being correct. The joint probability follows a simple chain: $P(\text{relation} \mid e_i, e_j) \cdot P(e_i) \cdot P(e_j)$. If either entity is wrong, the relation built on them is almost certainly wrong too.

Factor graphs [8][9] provide the framework for modeling these dependencies. A factor graph is a bipartite graph connecting variable nodes (entities, relations) to factor nodes (the functions defining how variables depend on each other). It generalizes Bayesian networks to handle arbitrary dependency structures without requiring a directed acyclic graph.

Our extraction confidence work uses evaluation metrics (CEAF [7]) and datasets from the AdaIE framework [1], which spans 15 datasets, 497 entity types, and 747 relation types, as the testing ground for calibration methods.

Why extraction is different from other NLP tasks

The obvious question: how is confidence for information extraction different from translation, sentiment analysis, or any other NLP task? The answer is the topology of dependencies. In question answering, questions are independent. In translation, dependencies are autoregressive. In information extraction, the dependency structure is a graph. Relations depend on entities. Entities depend on each other through relations. Corrections propagate non-locally through the graph in ways that sequential dependencies never do.

Entity types function as predicates over the universe of extracted mentions. "Software engineer" is a function: pass in a text span, it returns a probability. This connects extraction to Laplace's framework where probability extends logic. The predicate returns not true or false but a value between 0 and 1. Schema-constrained extraction makes this tractable: with known entity types, extraction decomposes into individual classification decisions (does this span have type X?) rather than unconstrained generation.

Belief propagation

When a human corrects one entity, confirming it or marking it wrong, messages propagate through the factor graph to update neighboring beliefs. Confirm entity A and every relation involving it gets a confidence boost. Reject entity A and every relation involving it drops. The update is local: you only re-evaluate the Markov blanket of the corrected node, not the entire graph.

Active correction ordering

The order in which you correct entities matters. Correcting a highly-connected entity first propagates information to more neighbors. Same insight as active learning and Bayesian optimization: choose the query that maximizes expected information gain across the graph, not just locally. A single well-chosen correction can shift confidence scores for dozens of downstream relations.

This is an open research gap: no existing work combines probabilistic knowledge graphs with Graph RAG [11]. The RAG community builds knowledge graphs but treats them as deterministic. The probabilistic graphical models community handles uncertainty but does not connect to retrieval pipelines. The intersection is unexplored.

Temporal knowledge graphs add another dimension. When documents are versioned (contract drafts, regulatory filings, policy updates), the knowledge graph changes over time. Tracking how entities and relations evolve across versions, which appeared, which were modified, which relationships were severed, turns the graph into a historical record. For merger agreements, you could ask: "From version 1 to the final signed agreement, what changed in the indemnification structure?" The shape of the graph across drafts becomes a first-class queryable artifact.

Evaluation methodology

Calibration quality cannot be assessed with a single metric. Three are required, each measuring a different property:

Discrimination (AUROC): Can the confidence score separate correct from incorrect outputs? A model that assigns higher confidence to correct outputs has good discrimination, even if the absolute scores are poorly calibrated. AUROC measures this ranking quality.

Calibration (ECE and reliability diagrams): Does 70% confidence actually mean 70% accuracy? ECE computes the weighted average gap between predicted confidence and observed accuracy across binned confidence ranges. But ECE alone is misleading: a model that always outputs the dataset's base accuracy achieves zero ECE while being completely uninformative. Reliability diagrams (predicted confidence on X-axis, actual accuracy on Y-axis, target = diagonal) reveal the full picture.

Selective prediction (risk-coverage curves): Does discarding low-confidence items improve final accuracy? Given a budget of N human reviews, does prioritizing the lowest-confidence items beat random selection? This directly simulates the production use case where confidence scores route work to human reviewers.

The Brier score punishes uncertainty itself (it can only be zero with perfect 0/1 predictions), making it inappropriate for tasks with irreducible ambiguity. ECE with binning averages out inherent randomness. AUROC is insensitive to calibration but captures ranking quality. All three together give a complete picture.

Active Learning and Human-in-the-Loop

You cannot query a human for every extraction. Humans are expensive and slow. The problem is a multi-objective contextual bandit: minimize calibration error while minimizing the number of human queries. These objectives conflict. More human feedback always improves calibration, but you have a fixed budget.

The acquisition function is expected information gain. For each candidate entity, estimate how much total uncertainty reduction you would get from a human label. Entities with confidence near 0.5 are not automatically the best candidates. An entity at 0.5 confidence with 50 downstream relations is worth more than an entity at 0.5 confidence with zero downstream relations. The graph structure determines query value.

If you have a human budget of 10 minutes, you pick the 20 things you are most uncertain of. But uncertainty here means more than the confidence score: it is the confidence score weighted by downstream impact in the knowledge graph.

Even at high confidence, the system should occasionally verify for systematic drift. A model that was 95% accurate last month may have degraded. Periodic spot-checks on high-confidence outputs detect this before it compounds. This is the exploration-exploitation tradeoff: mostly exploit your confidence estimates to focus human effort on uncertain items, but explore enough to detect when the estimates themselves have gone stale.

The Bayesian extension replaces point estimates with full distributions. Instead of "this entity has confidence 0.73," the system maintains a $\text{Beta}(14, 5)$ distribution with mean 0.74 and 95% credible interval [0.55, 0.89]. A distribution captures both the model's best guess and how much evidence backs it. An entity with confidence 0.73 from three observations is genuinely less certain than one with confidence 0.73 from three hundred observations, even though the point estimates are identical.

The Safety Application

The same calibration architecture applies directly to agent safety. "Is this entity correct?" and "Is this action safe?" have identical mathematical structure. Both require a classifier, both suffer from overconfidence, both benefit from ensembles, and both need calibrated confidence to make threshold decisions meaningful.

Multi-specialist ensemble

Instead of a single monolithic safety model, the system deploys per-category specialists: one for privacy violations, one for financial risk, one for security threats, one for ethical concerns. Each specialist is a small model fine-tuned on category-specific data. This mirrors the black-box ensemble from confidence estimation, but with domain specialization.

Specialist diversity improved detection recall from 63% (single general judge) to 73% (four specialists). The improvement comes from more perspectives, not more parameters. A general safety model treats "agent emails SSH keys to an external address" and "agent provides unsolicited investment advice" as the same class of problem. A specialist architecture treats them as fundamentally different failure modes requiring different training data, evaluation criteria, and confidence thresholds.

Turn off individual specialists and you see their value directly. Without the finance specialist, unauthorized purchase actions pass through undetected. Without the privacy specialist, credential exfiltration slips by. Each specialist catches failures the others miss, because each is trained on a different slice of the risk taxonomy.

Privacy Specialist

Detects PII leakage, data exposure, consent violations. Trained on privacy-specific incident data. Flags outputs that contain or could reveal personal information. Catches credential and personal data exfiltration that general models miss.

Finance Specialist

Detects unauthorized transactions, regulatory violations, financial advice without disclaimers. Calibrated against financial compliance benchmarks. Catches subtle regulatory violations like investment guidance in customer-facing contexts.

Security Specialist

Detects prompt injection, code execution risks, credential exposure, system access attempts. Trained on adversarial attack patterns and the R-Judge benchmark's 27 risk scenarios across software, IoT, web, finance, and program categories [3].

Ethics Specialist

Detects harmful content generation, bias amplification, manipulation attempts. Calibrated on harm taxonomy benchmarks. Handles the deployment safety category where an agent's action could cause irreversible real-world consequences.

Benchmark results

On the R-Judge benchmark [3], approximately 500 human-annotated instances across five categories (finance, IoT, software, web, program) with 27 distinct risk scenarios, the multi-specialist ensemble achieves 66% recall for unsafe actions and 62% specificity, with ECE of 9%. This matches GPT-4's recall on the same benchmark using only small open-weight models. The 9% ECE means confidence scores are off by no more than 10 percentage points at any confidence level, compared to 14-28% for published guard models [2].

The ensemble used four models (Llama, Qwen, Gemma, Mistral) with two prompt variants each, producing eight independent judgments per input. Confidence scores exist only in the 0.5-1.0 range by construction, since majority vote is always at least 50%. Variation comes from model architecture, prompt formulation, and chain-of-thought vs. direct prompting.

LoRA adapter distillation

Cloud-based specialist judges are expensive to run in production. The distillation pipeline compresses them into LoRA adapters on a single small backbone model (Qwen 0.6B or Llama 1B). Each adapter is 5-25MB. Multiple adapters share one base deployment, making per-specialist inference cheap.

The pipeline has four phases: (1) synthetic data generation from 10-50 seed examples per category using frontier models, (2) teacher evaluation to validate the task is solvable at the target accuracy, (3) LoRA fine-tuning on the small backbone, (4) integration into the gateway. This pipeline is task-agnostic. The same four phases apply whether the target is safety detection, entity extraction, PII redaction, or any classification task.

Published benchmarks from comparable distillation approaches (Distill Labs, Berlin) show that 1-4B parameter models can match 120B model accuracy on targeted classification tasks. Their PII redaction benchmark shows a 1B model matching GPT-4. Turnaround is approximately 8 hours at approximately 100 USD per distillation.

Conservative aggregation

The aggregation rule is deliberately asymmetric. If any specialist flags an action, the action is flagged. The highest confidence among flagging specialists is reported. If all specialists say safe, the minimum confidence is reported. This is conservative by design. False positives (blocking safe actions) are recoverable: the user can override or the system can escalate. False negatives (allowing unsafe actions) are not recoverable. The cost function is asymmetric, so the aggregation must be.

When multiple specialists flag the same action, combined unsafety scores climb rapidly via the product-of-complements formula: $P(\text{unsafe}) = 1 - \prod_i (1 - P(\text{unsafe}_i))$. Individual scores of 0.8, 0.9, and 0.85 yield a combined score of 0.997. The independence assumption is imperfect (deployment safety and secret leakage are correlated), but it works in practice for the same reason naive Bayes works for spam: the model's errors are tolerable because the aggregation is conservative.

For more principled handling of policy dependencies, you can apply the chain rule over a directed acyclic graph of policy relationships: $P(A, B \mid X) = P(A \mid X) \cdot P(B \mid A, X)$. An LLM can generate the dependency graph given a set of policies, enabling joint probability computation that accounts for correlations. This is the standard approach in probabilistic graphical models [8], but is unnecessary for initial deployment where naive independence suffices.

The mixture-of-opinions architecture

A related approach, inspired by Sanskrit epistemological traditions, organizes reasoning dimensions into specialized perspectives. The concept draws from a framework of approximately 208 reasoning dimensions derived from classical Indian schools of thought. For a given domain or task, a subset of these dimensions (typically 8-12, grouped into clusters) is compiled into specialized critics. Each dimension acts as a thinking hat: fallacy detection, irreversibility checks, sequencing analysis, causal reasoning, temporal consistency, logical coherence.

The architectural distinction from the ensemble is how the critics communicate. In the standard ensemble, each judge produces a text output and votes are aggregated. In the mixture-of-opinions architecture, critics share an embedding space. Queries flow to all critics as vectors in a common latent space, critics debate at the embedding level rather than through text, and synthesis happens in high-dimensional space before decoding back. This reduces context loss compared to text-based inter-model communication, where nuance is compressed through the tokenizer at every exchange.

The multi-stage synthesis: parallel debate (all critics analyze simultaneously), reconciliation (common conclusions drawn), contradiction resolution (disagreements adjudicated), then final answer generation. The underlying models can be open-source LLMs with LoRA adapters for each dimension, making the approach feasible on commodity hardware.

The critical difference: the mixture-of-opinions architecture does not produce calibrated confidence scores. It produces better answers, but cannot tell you how confident it is in those answers. The ensemble approach produces calibrated confidence as a first-class output. Combining the two, diverse perspectives communicating in latent space with calibrated uncertainty estimation on top, is an open problem worth solving.

Per-organization calibration

Safety thresholds are not universal. What counts as unsafe depends on the organization, the domain, the regulatory environment, and the specific workflow. A marketing agency and a healthcare provider have fundamentally different risk profiles. Calibration must be per-organization.

Every approve/deny decision by a human reviewer becomes training data for that organization's calibration model. Over time, the system learns their specific risk tolerance, common edge cases, and policy interpretations. The longer an organization uses the system, the better it calibrates to their needs. The data dependency that initially looks like a weakness (you need organization-specific data to calibrate) turns into a lock-in effect. No competitor can replicate another organization's accumulated calibration data.

Information Geometry and Fine-Tuning

Model knowledge can be understood as a high-dimensional manifold in parameter space. Points on this manifold represent probability distributions over outputs. The geometry of this manifold, its curvature and geodesics, determines how fine-tuning moves through distribution space.

Two kinds of geodesics matter. E-geodesics (exponential family) preserve mean parameters during interpolation. M-geodesics (mixture) preserve natural parameters. The choice between them determines what is conserved during fine-tuning and what drifts. Standard gradient descent follows neither. It follows Euclidean straight lines in parameter space, which are curved paths in distribution space.

Natural gradient descent operates in Fisher information space rather than Euclidean parameter space. The Fisher information matrix measures how sensitive the model's output distribution is to each parameter. Moving in the natural gradient direction makes equal-sized changes in distribution space, regardless of how the parameters happen to be scaled. This connects directly to catastrophic forgetting: standard fine-tuning can destroy previously learned distributions because Euclidean steps in parameter space correspond to wildly unequal steps in distribution space.

If you could improve information theory by using geodesics, it would improve literally everything that depends on it. Every loss function, every optimizer, every training schedule is implicitly navigating this manifold. Get the navigation right and you converge faster, forget less, and waste less data.

Programs as Weights

Natural language instructions can be compiled into LoRA adapters: small weight modifications (approximately 5-25MB) that encode behavioral programs into a base model. The base model can be as small as GPT-2 124M or Qwen 0.6B. These adapters run in the browser via WebAssembly with no server round-trip.

This is policy enforcement at the weight level. Instead of prompting a model with safety rules at inference time (which can be bypassed), you compile the rules into the model's weights. The policy is part of the model itself, not an input that can be jailbroken.

The gap is calibration. Programs-as-weights can enforce behaviors but cannot currently report how confident they are in their enforcement. A compiled policy adapter can block an action but cannot say "I am 73% sure this action violates the policy." Connecting the calibration methods described above to compiled policy adapters is an open problem. Solve it and you get tamper-resistant policies that can also tell you how sure they are.

References

Xu, J. et al. "AdaIE: Guidelines for Universal Information Extraction." Anonymous ACL submission, 2026. 15 datasets, 497 entity types, 747 relation types, CEAF-based evaluation metrics.
"On the Calibration of LLM-based Guard Models for Content Moderation." 2026. Llama Guard, Shield Guard, Wild Guard ECE of 14-28%.
R-Judge benchmark. ~500 human-annotated instances, 5 categories (finance, IoT, software, web, program), 27 risk scenarios.
"Calibrating LLM Judges: A Linear Probe for Fast Reliable Uncertainty Estimation." Linear probe achieving ~0.02 ECE on MMLU Pro.
OpenAI. "GPT-4 Technical Report." 2023. Documents calibration degradation after RLHF.
Schellhammer & Blazer, Vector Institute. "Asymmetric Viewless Sidekicks Improve Uncertainty."
Luo, X. "On Coreference Resolution Performance Metrics." 2005. CEAF metric used in AdaIE evaluation.
Bishop, C. Pattern Recognition and Machine Learning. Chapter 8: Graphical Models. Factor graphs, belief propagation.
Frey, B. et al. "Factor graphs and the sum-product algorithm." Co-inventor of factor graphs, founder of Deep Genomics.
"High Fidelity Information Extraction from Black Box LLMs" (Safe Passage). Three-step pipeline with fuzzy alignment.
"Towards Trustworthy Knowledge Graph Reasoning." Conformal prediction for knowledge graph retrieval paths.