Confidence Thresholds Explained
What AI confidence scores mean, how to set thresholds, how to calibrate over time.
Confidence Thresholds Explained
Every AI system outputs a confidence score with its predictions. Most firms ignore these scores or set arbitrary thresholds without understanding the consequences. This creates two problems: you either automate decisions the AI isn't confident about (causing errors), or you send too many decisions to human review (wasting the automation investment).
Confidence thresholds determine which AI outputs get automated, which get reviewed, and which get rejected. Set them wrong and you'll either drown your team in false positives or miss critical errors. Set them right and you'll automate 70-80% of routine work while catching edge cases before they become problems.
What Confidence Scores Actually Measure
A confidence score is the AI model's probability estimate that its output is correct. A score of 0.87 means the model believes there's an 87% chance its classification, extraction, or prediction is accurate.
These scores come from the model's internal probability distribution. For classification tasks (like document routing or contract clause identification), the score represents the probability assigned to the highest-ranked category. For extraction tasks (like pulling dates or dollar amounts from invoices), it reflects the model's certainty about the extracted value.
Critical point: Confidence scores are calibrated differently across models. A 0.90 from GPT-4 doesn't mean the same thing as a 0.90 from a custom-trained document classifier. You must calibrate thresholds separately for each model and use case.
Confidence scores answer one question: "How much should I trust this output?" Without them, you're flying blind. With them, you can build intelligent routing rules that balance automation rate against error rate.
The Accuracy-Coverage Tradeoff
Every confidence threshold creates a tradeoff between two metrics:
Accuracy: The percentage of accepted AI outputs that are actually correct.
Coverage: The percentage of total cases the AI handles without human intervention.
Raise your threshold and accuracy goes up while coverage drops. Lower it and coverage increases while accuracy falls. There's no free lunch.
Here's what this looks like in practice for a contract review system:
| Threshold | Accuracy | Coverage | What This Means | |-----------|----------|----------|-----------------| | 0.95 | 98% | 45% | AI handles 45% of contracts with 98% accuracy. 55% go to human review. | | 0.90 | 95% | 68% | AI handles 68% of contracts with 95% accuracy. 32% go to human review. | | 0.85 | 91% | 82% | AI handles 82% of contracts with 91% accuracy. 18% go to human review. | | 0.80 | 86% | 91% | AI handles 91% of contracts with 86% accuracy. 9% go to human review. |
The right threshold depends on the cost of errors versus the cost of human review. For routine NDA reviews, 0.85 might be perfect. For M&A due diligence, you want 0.95 or higher.
Setting Your Initial Thresholds
Start with a three-tier system: auto-approve, human review, and auto-reject.
Step 1: Calculate your error cost
What does a missed error cost you? For invoice processing, maybe $500 in payment disputes. For legal document review, potentially $50,000+ in liability. For client intake screening, a lost client relationship.
Divide your average transaction value by your error cost. If you process $2,000 invoices and errors cost $500 to fix, your error tolerance is 25%. You need 96%+ accuracy to break even (4% error rate × $500 = $20 cost per invoice, or 1% of value).
Step 2: Run a calibration test
Take 500-1,000 representative examples. Run them through your AI system and record the confidence score for each output. Have humans review all outputs and mark which ones are correct.
Plot accuracy against confidence score in 5-point buckets:
- 0.95-1.00: What percentage were actually correct?
- 0.90-0.95: What percentage were actually correct?
- 0.85-0.90: What percentage were actually correct?
- And so on...
This tells you the real-world accuracy at each confidence level for your specific use case.
Step 3: Set your thresholds
Based on your calibration data and error cost calculation:
Auto-approve threshold: Set this where accuracy meets or exceeds your required level. If you need 96% accuracy and your calibration shows 96% accuracy at 0.88 confidence, set your auto-approve at 0.90 (adding a 2-point safety buffer).
Human review threshold: Set this 10-15 points below auto-approve. Cases between 0.75-0.90 go to human review. This catches borderline cases before they become errors.
Auto-reject threshold: Anything below 0.75 gets rejected immediately. The AI isn't confident enough to be useful, so don't waste human time reviewing it.
Step 4: Document your decision
Create a one-page threshold specification:
AI System: Contract Clause Extraction
Model: GPT-4 with custom prompt
Use Case: NDA review automation
Thresholds:
- Auto-approve: ≥0.90 (expected accuracy: 96%)
- Human review: 0.75-0.89 (expected accuracy: 88%)
- Auto-reject: <0.75
Rationale:
- Error cost: $5,000 (average cost to remediate missed clause)
- Transaction value: $50,000 (average contract value)
- Required accuracy: 95% (error cost = 10% of transaction value)
- Calibration date: 2024-01-15
- Sample size: 847 contracts
Review schedule: Monthly for first 3 months, then quarterly
Building Escalation Workflows
Thresholds are useless without clear routing rules. Here's how to operationalize them:
For auto-approve cases (≥0.90):
- Process automatically
- Log decision and confidence score
- Sample 5% for spot-check audits
- Flag for review if downstream systems reject the output
For human review cases (0.75-0.89):
- Route to review queue with priority based on confidence (lower confidence = higher priority)
- Show the AI's output and confidence score to the reviewer
- Require explicit approve/reject decision
- Track reviewer agreement rate with AI output
- If agreement rate >95% for 3 consecutive months, consider lowering auto-approve threshold
For auto-reject cases (<0.75):
- Return to sender with specific error message
- Do not waste human review time
- Log rejection reason and confidence score
- If rejection rate >20%, investigate root cause (bad input data, model drift, prompt issues)
Example routing rule in pseudocode:
if confidence >= 0.90:
auto_approve()
log_decision(confidence, output)
if random() < 0.05:
add_to_audit_queue()
elif confidence >= 0.75:
route_to_human_review(priority = 1 - confidence)
show_ai_output_to_reviewer()
require_explicit_decision()
else:
auto_reject()
log_rejection(confidence, reason)
notify_sender("Insufficient confidence for processing")
Calibrating Over Time
Confidence thresholds drift. Model updates, changing input data, and evolving business requirements all affect the accuracy-confidence relationship. Recalibrate quarterly at minimum.
Monthly monitoring (15 minutes):
Track three metrics in your dashboard:
- Accuracy by confidence bucket: Is the 0.90+ bucket still hitting 96% accuracy?
- Coverage rate: What percentage of cases are auto-approved vs. reviewed vs. rejected?
- Reviewer agreement rate: How often do humans agree with AI outputs in the review queue?
If accuracy drops 2+ percentage points in any bucket, trigger a recalibration. If coverage drops 10+ percentage points, investigate for input data quality issues.
Quarterly recalibration (2-3 hours):
- Pull 500 recent cases with confidence scores and human review decisions
- Recalculate accuracy by confidence bucket
- Compare to your original calibration data
- Adjust thresholds if accuracy has shifted 3+ percentage points
- Update your threshold specification document
- Communicate changes to all users
Example calibration shift:
Original calibration (January 2024):
- 0.90+ confidence = 96% accuracy
Q2 recalibration (April 2024):
- 0.90+ confidence = 93% accuracy (3-point drop)
Action: Raise auto-approve threshold from 0.90 to 0.93 to maintain 96% accuracy target. Update documentation. Notify review team that coverage will drop from 68% to 61% temporarily while investigating root cause of accuracy decline.
Common causes of threshold drift:
- Model updates from vendor (GPT-4 to GPT-4.5, for example)
- Changes in input data distribution (new contract types, different client mix)
- Prompt modifications that affect output confidence
- Seasonal patterns in case complexity
Real-World Threshold Examples
Invoice processing (accounting firm):
- Auto-approve: 0.92 (handles 73% of invoices, 97% accuracy)
- Human review: 0.80-0.91 (handles 21% of invoices)
- Auto-reject: <0.80 (6% of invoices, usually missing data or poor scan quality)
Legal document classification (law firm):
- Auto-approve: 0.95 (handles 58% of documents, 98% accuracy)
- Human review: 0.85-0.94 (handles 35% of documents)
- Auto-reject: <0.85 (7% of documents, usually non-standard formats)
Client intake screening (consulting firm):
- Auto-approve: 0.88 (handles 81% of inquiries, 94% accuracy)
- Human review: 0.75-0.87 (handles 16% of inquiries)
- Auto-reject: <0.75 (3% of inquiries, usually incomplete submissions)
The pattern: higher-risk use cases need higher thresholds. Lower-risk use cases can tolerate more automation at lower confidence levels.
Set your thresholds based on calibration data, not gut feel. Monitor them monthly. Recalibrate quarterly. Adjust when accuracy drifts. This is how you maintain reliable AI automation over time.

Reviewed by Revenue Institute
This guide is actively maintained and reviewed by the implementation experts at Revenue Institute. As the creators of The AI Workforce Playbook, we test and deploy these exact frameworks for professional services firms scaling without new headcount.
Revenue Institute
Need help turning this guide into reality? Revenue Institute builds and implements the AI workforce for professional services firms.