Have you not heard the tale of the woman
who promised to change the world with a drop of blood,
who raised billions on a test that never worked?
Palo Alto, 2003
STANFORD UNIVERSITY
A nineteen-year-old dropped out with a vision: hundreds of blood tests from a single drop.

Investors believed. Walgreens believed. The Pentagon believed.

They gave her $9 billion.

But the tests gave wrong results. Patients were told they had HIV when they didn't. Patients were told their blood was normal when they were dying.
Carreyrou J. Bad Blood. 2018
The Decision Tree of Deception

What Theranos Did vs. What Should Happen

New Diagnostic Test
SHOULD DO
Validate Against Gold Standard
Publish TP/FP/FN/TN
FDA Approval
THERANOS DID
Skip Validation
Hide Failures
Harm Patients
"And the test lied,
and the lie was dressed in certainty,
and no one asked for the 2×2 table."

This is why we study Diagnostic Test Accuracy.

When a test speaks,
there are only four possible truths.

Two are blessings. Two are curses.
The Tree of Outcomes

Every Test Result Has a Reality Behind It

Patient Tested
What is the TRUTH?
Has Disease
D+
TPTest +
FNTest -
No Disease
D-
FPTest +
TNTest -
The Sacred 2×2 Table

HIV Rapid Test Example (Real Data)

HIV+HIV-Total
Test +983101
Test -2895897
Total100898998
FROM THIS TABLE COMES ALL TRUTH
Sensitivity = 98/100 = 98%
Specificity = 895/898 = 99.7%
"Two outcomes save. Two outcomes harm.
TP, TN: the test spoke true.
FP, FN: the test lied.
Know them by name, for they determine fate."
Have you not heard of the blood that was tested,
found clean,
and given to thousands—
while death swam within it?
The Blood Supply Crisis, 1985
UNITED STATES
When HIV testing began, doctors celebrated: they could now screen the blood supply.

But the test had a window period—weeks after infection when the virus was present but undetectable.

Blood was tested. Blood was "negative." Blood was transfused.

8,000-12,000 Americans were infected through transfusions before better tests closed the window.
CDC. MMWR. 1987;36(49):833-840
The Window Period Decision Tree

Why False Negatives Are Deadly

Person Recently Infected
Time Since Infection?
< 2 weeks
Test NEGATIVEVirus present!
Blood DonatedOthers infected
> 4 weeks
Test POSITIVECorrectly detected
Blood DiscardedSupply safe
Sensitivity Changes Over Time
0%
Day 1-7
Eclipse period
~50%
Day 14
Seroconversion
~95%
Day 21
Most detected
99.9%
Day 45+
Window closed
THE LESSON
Sensitivity is not fixed. It depends on when you test. A "99% sensitive" test may be 0% sensitive in early infection.
"And the test said 'clean,'
for the virus had not yet shown its face.
And the blood was shared,
and the infection spread to the innocent."
Have you not heard of the pill given to mothers
to protect their pregnancies,
that planted cancer in their daughters
twenty years before it bloomed?
The DES Tragedy, 1938-1971
UNITED STATES & EUROPE
Diethylstilbestrol (DES) was given to millions of pregnant women to prevent miscarriage.

No proper clinical trial was ever conducted. Doctors assumed it worked because it seemed reasonable.

Decades later, their daughters developed a rare cancer: clear cell adenocarcinoma of the vagina. A cancer so rare it was a diagnostic signal in itself.

5-10 million women were exposed. The harm crossed generations.
Herbst AL et al. N Engl J Med. 1971;284:878-881
The Validation Decision Tree

What Should Have Happened

New Medical Intervention
Was It Properly Tested?
YES
Randomized Trial
Long-term Follow-up
Know True EffectsBenefits AND harms
NO (DES)
Assumption Only
Widespread Use
Hidden HarmDiscovered too late
The Diagnostic Signal
WHEN RARITY BECOMES EVIDENCE
Clear cell adenocarcinoma of the vagina was so rare in young women that 7 cases in one hospital triggered an investigation.

The cluster itself was the diagnostic test:
Sensitivity to DES exposure: nearly 100%
If you have this cancer at this age, you were almost certainly exposed.
1:1000
Risk of clear cell
cancer in DES daughters
5-10M
Women exposed
worldwide
"And the mothers took the pill in hope,
and the daughters grew in shadow,
and twenty years later the cancer bloomed—
a diagnosis that indicted a generation of medicine."
A test has two virtues and two vices.

Sensitivity: Can it find the sick?

Specificity: Can it spare the healthy?
Sensitivity: The Hunter
THE FORMULA
Sensitivity = TP / (TP + FN)
"Of all the sick, how many did we catch?"

Worked Example: COVID PCR Test

Given: 200 infected patients tested
TP = 196 (correctly positive), FN = 4 (missed)
Sensitivity = 196 / (196 + 4) = 196/200 = 98%
Interpretation: Test catches 98 of every 100 infected people
Specificity: The Guardian
THE FORMULA
Specificity = TN / (TN + FP)
"Of all the healthy, how many did we spare?"

Worked Example: Same COVID PCR Test

Given: 1000 uninfected people tested
TN = 999 (correctly negative), FP = 1 (false alarm)
Specificity = 999 / (999 + 1) = 999/1000 = 99.9%
Interpretation: Test correctly clears 999 of every 1000 healthy people
The Memory Rules

When to Use Which Test

What Do You Need?
RULE OUT disease
Use HIGH SENSITIVITY
SnNoutSensitive Negative = OUT
RULE IN disease
Use HIGH SPECIFICITY
SpPinSpecific Positive = IN
"Sensitivity catches the sick.
Specificity spares the well.
But no test masters both perfectly—
this is the burden we bear."
Have you not seen the physician
who saw 99% accurate
and believed a positive result meant 99% certainty?

This is the deadliest error in medicine.
The Base Rate Fallacy
THE PUZZLE
A disease affects 1 in 1000 people.
A test is 99% sensitive and 99% specific.
A patient tests positive.

What is the probability they have the disease?

Most doctors say ~99%. The real answer is about 9%.
The Math Revealed

Testing 100,000 People (Prevalence 1/1000)

Step 1: 100 have disease, 99,900 healthy
Step 2: Of 100 sick: 99 test positive (TP), 1 negative (FN)
Step 3: Of 99,900 healthy: 999 test positive (FP), 98,901 negative (TN)
Step 4: Total positives = 99 + 999 = 1,098
PPV = TP / All Positives = 99 / 1,098 = 9%
91% of positive results are FALSE POSITIVES!
Interactive Base Rate Calculator

See How Prevalence Changes PPV

Prevalence:
1%
Sensitivity:
99%
Specificity:
99%
9%
Positive Predictive Value (PPV)
91% of positives are false alarms
The Decision Tree of Prevalence

Same Test, Different Settings

Test: 99% Sens, 99% Spec
Where Is Testing Done?
General Pop
0.1%
PPV = 9%91% false +
High-Risk
10%
PPV = 92%8% false +
Confirmatory
50%
PPV = 99%1% false +
"And the doctor said '99% accurate,'
and the patient heard '99% certain,'
and both were deceived—
for they forgot to ask: How rare is this disease?"
Have you not heard of the machine
that could find TB in two hours,
that was called revolutionary
but missed the drug-resistant strains?
The GeneXpert Story, South Africa
CAPE TOWN, 2010
For a century, TB diagnosis required growing bacteria for weeks. Then came GeneXpert: results in 2 hours.

South Africa deployed it nationwide. The WHO endorsed it.

But in patients with low bacterial loads—often HIV co-infected— sensitivity dropped to 67%. One in three cases missed.

And for detecting rifampicin resistance, it missed 5% of resistant cases. Those patients received the wrong treatment. The resistant TB spread.
Steingart KR et al. Cochrane Database Syst Rev. 2014;1:CD009593
TB Diagnosis Decision Tree

When GeneXpert Is Not Enough

Suspected TB Patient
GeneXpert Test
Positive
Rifampicin?
SensitiveStandard Tx
ResistantMDR-TB Tx
Negative
HIV+ or High Suspicion?
YesCulture needed
NoLikely negative
Sensitivity by Patient Type
98%
Smear-positive
(high bacterial load)
67%
Smear-negative
(low bacterial load)
61%
HIV co-infected
(immune suppressed)
THE LESSON
A test's sensitivity in clinical trials may not match its sensitivity in your patients. Know your population.
"And the machine said 'negative,'
and the doctor believed the machine,
and the patient went home with TB in their lungs,
coughing resistance into the world."
Have you not heard of the test for men
that found cancers that would never kill,
and led to treatments that destroyed lives?
The PSA Screening Tragedy
UNITED STATES, 1990s-2010s
PSA (Prostate-Specific Antigen) could detect prostate cancer early.

Doctors screened millions of men. Cancers were found. Prostates were removed.

But many of these "cancers" would never have caused symptoms. The surgery caused impotence and incontinence in men who would have died of old age, not cancer.
Moyer VA. Ann Intern Med. 2012;157:120-134
The Numbers of Harm
1
Life saved from
prostate cancer
per 1000 screened
30-40
Men made impotent
or incontinent
per 1000 screened
100+
False positives
(biopsies, anxiety)
per 1000 screened
THE REVERSAL
In 2012, the US Preventive Services Task Force recommended against routine PSA screening. The test was finding too much that didn't need finding.
Patient Decision Aid: PSA Screening

If 1,000 Men Age 55-69 Are Screened for 13 Years

Deaths from prostate cancer prevented
1-2 men
Men who will have false positive requiring biopsy
100-120 men
Men diagnosed with cancer that would never harm them
20-50 men
Men left impotent or incontinent from treatment
30-40 men
Is this trade-off acceptable to you?
"And the test found the shadow,
and the surgeon cut,
and the man lived—impotent, incontinent—
from a cancer that would never have woken."
Have you not heard of the man with chest pain
whose first troponin was normal,
who was sent home—
and died before morning?
The Troponin Timing Problem
EMERGENCY DEPARTMENTS WORLDWIDE
Troponin is the gold standard for heart attack diagnosis. But it takes 3-6 hours to rise after myocardial injury.

A patient arrives one hour after chest pain begins. Troponin is tested: normal. "You're fine. Go home."

The heart was dying. The protein hadn't leaked yet.

Studies show 2-5% of MI patients sent home from ED die within 30 days.
Pope JH et al. N Engl J Med. 2000;342:1163-1170
Serial Testing Decision Tree

The Two-Troponin Protocol

Chest Pain Patient
First Troponin
Elevated
Treat as MI
Normal
When Did Pain Start?
<6 hrs
Wait 3 hrsRepeat troponin
>6 hrs
Low riskConsider d/c
High-Sensitivity Troponin
~70%
Conventional troponin
sensitivity at 0 hrs
~95%
hs-Troponin
sensitivity at 0 hrs
99%
hs-Troponin
at 3 hrs serial
THE TRADE-OFF
High-sensitivity troponin catches more heart attacks early. But it also has more false positives—elevated in kidney disease, heart failure, sepsis, and marathon runners.
"And the test said 'normal,'
for the heart had just begun to die.
And the patient was reassured,
and went home to finish dying."
Sensitivity describes the test.
Specificity describes the test.

But the patient asks:
"I tested positive. What are MY chances?"
Likelihood Ratios
POSITIVE LIKELIHOOD RATIO
LR+ = Sensitivity / (1 - Specificity)
How much more likely is a + result in sick vs healthy?
NEGATIVE LIKELIHOOD RATIO
LR- = (1 - Sensitivity) / Specificity
How much more likely is a - result in sick vs healthy?
The Fagan Nomogram

From Pre-Test to Post-Test Probability

Pre-Test
Probability
99%
50%
20%
5%
1%
Likelihood
Ratio
100
10
1
0.1
0.01
Post-Test
Probability
99%
80%
50%
20%
1%
Draw a line from pre-test through LR to find post-test probability
Interpreting Likelihood Ratios

How Powerful Is This Test?

LR+ Value?
LR+ > 10Strong rule-in
5-10Moderate
2-5Weak
1-2Useless
LR- Value?
< 0.1Strong rule-out
0.1-0.2Moderate
0.2-0.5Weak
0.5-1Useless
"Sensitivity tells of the sick.
Specificity tells of the well.
But the likelihood ratio answers:
What does this result mean for THIS patient?"
Have you not seen the child with fever in the village,
the rapid test that said negative,
and the Plasmodium that kept multiplying?
The Malaria RDT Problem
SUB-SAHARAN AFRICA
Malaria kills 600,000 people yearly, mostly children under 5.

Rapid Diagnostic Tests were meant to guide treatment in remote areas without microscopes or laboratories.

But when parasitemia is low—the RDT misses cases. And when P. falciparum deletes the HRP2 gene— the RDT sees nothing at all.
WHO. Malaria RDT Performance. 2022
The Clinical Decision Tree

Child with Fever in Malaria-Endemic Area

Febrile Child
Perform RDT
RDT Positive
Treat for Malaria
RDT Negative
Clinical Suspicion?
High
Treat Anywayor Microscopy
Low
Look forOther Cause
Sensitivity Varies by Parasitemia
95%
High parasitemia
(>200/μL)
75%
Low parasitemia
(100-200/μL)
50%
Very low
(<100/μL)
THE CLINICAL LESSON
A negative RDT does not rule out malaria in endemic areas. Clinical judgment must override the test when suspicion is high.
"And the test said 'negative,'
and the child was sent home,
and the parasites multiplied in the dark,
and by morning the child could not wake."
In the year of pestilence,
the world needed a test that was fast.

But fast is not the same as accurate.
The Cochrane Verdict

COVID-19 Rapid Antigen Tests (155 Studies)

PopulationSensitivityMissed
Symptomatic73%27%
Asymptomatic55%45%
First 7 days80%20%

Dinnes J et al. Cochrane Database Syst Rev. 2022;7:CD013705

The False Security Decision Tree

Thanksgiving 2020: What Happened

Family Member Tests Negative
Truly Negative?
55% if asymptomatic
True NegativeSafe to gather
45% if asymptomatic
FALSE NegativeInfectious!
Gathers with FamilyGrandparents infected
"And the test said 'negative,'
and the family embraced,
and by winter's end,
the grandfather was buried."
Have you not heard of the screening
that found cancers that would never kill,
and led to treatments that caused more harm than the disease?
The Overdiagnosis Problem
3-4
Lives saved
per 10,000 screened
~15
Overdiagnosed
(treated unnecessarily)
~500
False alarms
(anxiety, biopsies)
THE QUESTION
To save 3-4 lives, ~15 women receive surgery, radiation, and chemotherapy for cancers that would never have harmed them.

Is this trade-off worth it?
Patient Decision Aid: Mammography

If 10,000 Women Age 50-69 Are Screened for 10 Years

Deaths from breast cancer prevented
3-4 women
Women called back for false alarms
~500 women
Unnecessary biopsies
~200 women
Women treated for cancer that would never harm them
~15 women
Is screening right for you?
The Screening Cascade Decision Tree

10,000 Women Screened Over 10 Years

10,000 Women
~1,000 RecalledAbnormal
~500 False
Alarm
~500 Biopsy
~50 cancer
~9,000 Cleared
Of ~50 Cancers Found
~35 Would Kill3-4 saved
~15 Would Never KillOverdiagnosed
"And the test found the shadow,
and called it cancer,
and the woman was cut and burned—
for a shadow that would never have darkened her days."
Have you not heard of the scan
that finds the plaques in the brain,
but cannot tell you
if the mind will fade?
The Amyloid Paradox
ALZHEIMER'S RESEARCH, 2010s-2020s
PET scans can now detect amyloid plaques—the hallmark of Alzheimer's.

But 30% of cognitively normal elderly have amyloid plaques. They may never develop dementia.

And 10-20% of people with dementia have no amyloid.

The test finds the plaques. But plaques are not the disease. We are testing for a surrogate, not the outcome.
Jack CR et al. Lancet Neurol. 2018;17:760-773
Surrogate vs. Outcome Decision Tree

What Are We Really Testing For?

Diagnostic Test
What Does It Detect?
Outcome itself
Direct Diagnosise.g., Biopsy for cancer
High clinical value
Surrogate marker
Indirect Signale.g., Amyloid for dementia
Validated link?
YesUse cautiously
NoLimited value
"And the scan found the plaques,
and the doctor named it Alzheimer's,
and the patient lived in terror—
of a forgetting that might never come."
Not all studies are created equal.

Some are biased.
Some are poorly designed.
Some should not be trusted.

How do we separate the wheat from the chaff?
QUADAS-2: The Quality Checklist

Four Domains of Risk of Bias

1
Patient Selection

Was a consecutive or random sample enrolled? Was a case-control design avoided?

2
Index Test

Was the test interpreted without knowledge of the reference standard? Was the threshold pre-specified?

3
Reference Standard

Is the reference standard likely to correctly classify the condition? Was it interpreted blind?

4
Flow and Timing

Was there appropriate interval between tests? Did all patients receive the same reference standard?

QUADAS-2 Decision Tree

Should You Trust This Study?

DTA Study
Check All 4 Domains
All Low Risk
High QualityTrust results
Some Unclear
ModerateUse with caution
Any High Risk
Low QualityResults may be biased
Common Biases in DTA Studies
!

Verification Bias

Only positive tests get the reference standard → inflates sensitivity

!

Spectrum Bias

Study population differs from clinical reality → results don't generalize

!

Incorporation Bias

Index test is part of reference standard → artificially high accuracy

!

Review Bias

Index test interpreted knowing reference result → inflates both metrics

"Before you trust the numbers,
ask: How were they gathered?
A biased study speaks with confidence—
but its confidence is a lie."
One study may deceive.
One study may flatter.

But when you gather all the evidence
the truth becomes harder to hide.
Why DTA Meta-Analysis Is Different
THE PROBLEM
Sensitivity and specificity are correlated. When one goes up, the other tends to go down.

You cannot pool them separately like treatment effects. You need the bivariate model.
The SROC Curve

Summary Receiver Operating Characteristic

Sensitivity
1 - Specificity (False Positive Rate)
Individual studies
Summary estimate
Reading the SROC

What Does the Curve Tell You?

SROC Curve Position
Top-Left Corner
Excellent TestHigh sens + spec
Near Diagonal
Useless TestNo better than chance
Points Scattered
High HeterogeneityInvestigate sources
"One study may deceive.
Many studies, weighed together,
trace the path of truth—
the SROC curve that reveals what the test can truly do."
But what if the studies disagree?

One says sensitivity is 95%.
Another says 60%.

Which truth do you believe?
Sources of Heterogeneity

Why Studies Disagree

Same Test, Different Results?
ThresholdDifferent cutoffs
PopulationSeverity, age
SettingPrimary vs specialist
QualityBias, blinding
Measuring Disagreement: I²
I² < 25%
Low
Studies agree
I² 25-75%
Moderate
Some variation
I² > 75%
High
Major disagreement
THE WARNING
When I² > 75%, the pooled estimate may be meaningless. Explain the disagreement before averaging.
"When the studies disagree,
do not silence the dissent.
Ask: Why do they see differently?
The disagreement itself teaches."
Your DTA Toolkit
The essential measures and when to use them
The Checklist

Was there a valid reference standard?

Gold standard applied to ALL patients?

Were interpreters blinded?

Test readers unaware of diagnosis?

Was the spectrum appropriate?

Patients similar to your population?

Was the threshold pre-specified?

Or chosen to maximize results?

When Results Don't Match Suspicion

The Clinical Override Decision Tree

Test Negative, High Suspicion
What Is the LR-?
LR- < 0.1
Strong rule-outAccept negative
LR- 0.1-0.5
Repeat testOr different test
LR- > 0.5
Trust judgmentTest is weak
Sequential Testing Decision Tree

When One Test Isn't Enough

Initial Screening Test
Positive
Confirmatory TestHigh specificity
PositiveDiagnose
NegativeFalse alarm
Negative
Likely negativeIf high sens screen
"Armed with sensitivity, specificity, likelihood,
armed with the SROC and the measure of agreement,
you can see through the lie of the test—
and judge its truth for yourself."
Have you not heard of the patient
who received the wrong blood,
not because the test was wrong,
but because no one performed it?
The Test That Wasn't Done
HOSPITALS WORLDWIDE
ABO blood typing is nearly 100% accurate when performed.

Yet transfusion reactions still kill—not from test failure, but from human failure:

• Wrong blood drawn from wrong patient
• Labels switched in the lab
• Bedside check skipped in emergency

In the UK, 1 in 13,000 transfusions goes to the wrong patient. The test worked. The system failed.
Bolton-Maggs PHB. Transfus Med. 2016;26:303-311
Test vs. System Decision Tree

Where Can Things Go Wrong?

Diagnostic Process
Error Source?
Test itself
Analytical ErrorSens/Spec issue
Better test needed
Pre-analytical
Wrong sampleID error
System fix needed
Post-analytical
Wrong actionReporting error
Process fix needed
"The perfect test means nothing
if the wrong blood is drawn,
the wrong label is applied,
the wrong bag is hung."

DTA studies measure test accuracy. They do not measure system accuracy.

Have you not seen the algorithm
that learned from biased data,
and spread that bias
to every patient it touched?
The AI Diagnostic Revolution
STANFORD & BEYOND, 2017-PRESENT
Deep learning algorithms now match dermatologists at detecting skin cancer.

But the training data was predominantly light skin. On dark skin, performance dropped significantly.

The algorithm learned the patterns—but also the biases.

And when deployed without external validation, it performed worse than expected because the training population didn't match the clinical population.
Esteva A et al. Nature. 2017;542:115-118; Adamson AS. JAMA Dermatol. 2018
AI Validation Decision Tree

Is This AI Ready for Clinical Use?

AI Diagnostic Tool
Validation Type?
Internal only
High RiskOverfitting likely
Not ready
External validation
BetterBut check population
Matches your patients?
YesConsider use
NoCaution
Prospective RCT
Gold StandardPatient outcomes
AI Calibration: The Hidden Problem
DISCRIMINATION VS. CALIBRATION
Discrimination (AUC/ROC): Can the AI rank patients by risk?

Calibration: When the AI says "80% risk," do 80% actually have disease?

Many AI tools have good AUC but poor calibration. This is the base rate fallacy in algorithmic form.
AUC
Can it rank?
(usually reported)
CAL
Is probability accurate?
(often ignored)
"And the algorithm learned from the data,
and the data was biased,
and the bias spread to every prediction—
and no one asked: Who was missing from the training set?"
The patient asks: "Is my test positive?"

But what they mean is:
"Do I have the disease?"

How do you bridge this gap?
Communication Scripts
SCRIPT 1: EXPLAINING A POSITIVE RESULT
"Your test came back positive. But I want to explain what that means."

"This test is good at finding people with the condition, but it also has false alarms."

"Based on your risk factors, there's about a [X]% chance this is a true positive."

"We'll do a confirmatory test to be certain before any treatment."
Communication Scripts
SCRIPT 2: EXPLAINING A NEGATIVE RESULT (HIGH SUSPICION)
"Your test came back negative, but I'm still concerned."

"This test can miss cases, especially early in the illness."

"Given your symptoms, I'd like to either repeat the test in a few days, or try a different test."

"A negative test doesn't always mean you're clear—your symptoms matter too."
Communication Decision Tree

How to Explain Test Results

Test Result
Positive
PPV?
>90%"Very likely true"
<90%"Need to confirm"
Negative
NPV?
>95%"Very reassuring"
<95%"Still watch symptoms"
Questions to Ask Your Doctor
1

"How accurate is this test?"

Ask for sensitivity and specificity in plain language

2

"What if the result is wrong?"

Understand the consequences of false positives and negatives

3

"What happens next?"

Will there be a confirmatory test? Repeat test? Treatment?

4

"What if I don't get tested at all?"

Understand the trade-offs of testing vs. not testing

"The test speaks in numbers.
The patient hears in fears and hopes.
The healer's task is translation—
to bridge the gap between statistic and soul."
A test may be accurate.
But is it worth it?

What does it cost—in money,
in anxiety, in harm?
The Test-Treatment Threshold

When Is Testing Worthwhile?

Pre-Test Probability
Very Low
Below Test ThresholdDon't test, reassure
Intermediate
Testing ZoneTest will change management
Very High
Above Treat ThresholdDon't test, treat
THE PRINCIPLE
Test only when the result will change what you do. If you'd treat regardless, or not treat regardless—why test?
GRADE Evidence Quality

Grading DTA Evidence

⊕⊕⊕⊕
HIGH

Multiple high-quality studies, consistent results, directly applicable

⊕⊕⊕○
MODERATE

Some limitations in study quality, consistency, or applicability

⊕⊕○○
LOW

Serious limitations—may need to downgrade recommendations

⊕○○○
VERY LOW

Very serious limitations—evidence uncertain

Cost-Consequence Analysis

Example: Universal vs. Targeted Screening

Cost per case detected (universal)
$50,000
Cost per case detected (high-risk only)
$5,000
Cases missed by targeted approach
~10%
False positives avoided by targeted
~90%
Which approach is right for your population?
"A test is not just accurate or inaccurate.
It has costs—in money, in worry, in harm.
The wise clinician weighs all of these—
and tests only when testing serves the patient."
The SROC curve shows where the test performs.

But how certain are we?
And how much will it vary in practice?
Confidence vs. Prediction Regions

Two Types of Uncertainty

95% CI (summary estimate)
95% Prediction (future studies)
What Each Region Tells You
CI

Confidence Region (smaller ellipse)

Where we're 95% confident the true average sensitivity/specificity lies. Uncertainty about the summary estimate.

PI

Prediction Region (larger ellipse)

Where we expect 95% of future studies to fall. Accounts for heterogeneity between studies.

CLINICAL IMPLICATION
If the prediction region is large, the test may perform very differently in your setting than the average suggests. Wide prediction = high heterogeneity = investigate sources.
Bivariate Model Interpretation

Reading Meta-Analysis Results

Summary Sens/Spec
Check Regions
CI narrow, PI narrow
ConsistentTrust the average
CI narrow, PI wide
HeterogeneousAverage may not apply
CI wide
UncertainNeed more studies
"The confidence region tells you: How sure are we?
The prediction region tells you: How much will it vary?
Both questions matter—
for the test you use tomorrow may not be the average."
References

Key Sources

  1. Carreyrou J. Bad Blood. Knopf, 2018. [Theranos]
  2. CDC. MMWR. 1987;36(49):833-840. [HIV blood supply]
  3. Herbst AL et al. N Engl J Med. 1971;284:878-881. [DES]
  4. Moyer VA. Ann Intern Med. 2012;157:120-134. [PSA]
  5. Pope JH et al. N Engl J Med. 2000;342:1163-1170. [Troponin]
  6. Steingart KR et al. Cochrane 2014;1:CD009593. [GeneXpert]
  7. Dinnes J et al. Cochrane 2022;7:CD013705. [COVID RAT]
  8. UK Panel. Lancet. 2012;380:1778-1786. [Mammography]
  9. Jack CR et al. Lancet Neurol. 2018;17:760-773. [Amyloid]
  10. WHO. Malaria RDT Performance. 2022.
  11. Reitsma JB et al. J Clin Epidemiol. 2005;58:982-990. [Bivariate]
  12. Whiting PF et al. Ann Intern Med. 2011;155:529-536. [QUADAS-2]
  13. Bolton-Maggs PHB. Transfus Med. 2016;26:303-311.
A test is 99% sensitive and 99% specific. Disease prevalence is 1/1000. A patient tests positive. What is the probability they have the disease?
99%
90%
About 9%
50%
What does "SnNout" mean?
A highly Sensitive test, when Negative, rules OUT disease
A highly Specific test, when Negative, rules OUT disease
Sensitivity should be used for screening
Specificity should be above 90%
Why did the blood supply become contaminated with HIV despite testing?
The tests had low specificity
Tests had a window period with zero sensitivity in early infection
The tests were not performed correctly
The tests were too expensive
Which QUADAS-2 domain assesses whether the test was interpreted without knowing the diagnosis?
Patient Selection
Index Test
Reference Standard
Flow and Timing
Course Complete
"Now you know the four outcomes,
the two virtues of a test,
the fallacy of the base rate,
the art of pooling evidence,
and the biases that hide the truth.

When the next test lies to you—
you will know."