Have you not heard the tale of the woman
who promised to change the world with a drop of blood,
who raised billions on a test that never worked?
Palo Alto, 2003
STANFORD UNIVERSITY
A nineteen-year-old dropped out of college with a vision: a device that could run hundreds of blood tests from a single drop, pricked painlessly from a finger.

No more needles. No more vials. No more waiting.

Investors believed. Walgreens believed. The Pentagon believed.

They gave her $9 billion.
Carreyrou J. Bad Blood: Secrets and Lies in a Silicon Valley Startup. 2018
But the tests lied.
$9 Billion
Built on a test that didn't work
WHAT HAPPENED
The Theranos machines gave wrong results. Patients were told they had HIV when they didn't. Patients were told their blood was normal when they were dying. The company hid the truth for 15 years.
The Victims
PHOENIX, ARIZONA
A pregnant woman was told her baby would be born with a chromosomal abnormality. She was preparing for the worst. She was grieving.

The test was wrong. The baby was healthy.

But how many women, receiving the same news, made different decisions?
THE QUESTION
How do we know when a test is telling the truth? How do we measure a test's honesty?
"And the test lied,
and the lie was dressed in certainty,
and no one questioned the numbers."

This is why we study Diagnostic Test Accuracy.

When a test gives a result,
there are only four possible truths.

Two are blessings. Two are curses.
The Tree of Outcomes

Every Test Result Has a Reality Behind It

Patient Tested
Test Says: POSITIVE
"You have it"
Test Says: NEGATIVE
"You're clear"
TP
FP
FN
TN
Has disease
Test: Positive
No disease
Test: Positive
Has disease
Test: Negative
No disease
Test: Negative
The Four Truths

True Positive (TP)

Sick person correctly identified.
The test told the truth.

False Positive (FP)

Healthy person wrongly alarmed.
The test lied.

False Negative (FN)

Sick person wrongly reassured.
The deadliest lie.

True Negative (TN)

Healthy person correctly cleared.
The test told the truth.

The Sacred Table

The 2x2 Confusion Matrix

Disease Present Disease Absent
Test Positive TP
True Positive
FP
False Positive
Test Negative FN
False Negative
TN
True Negative
REMEMBER
Every diagnostic study must report these four numbers. Without them, you cannot judge a test.
"Two outcomes save. Two outcomes harm.
Know them by name.
TP, TN: the test spoke true.
FP, FN: the test lied."
A test has two virtues and two vices.

Sensitivity asks: Can it find the sick?

Specificity asks: Can it spare the healthy?
Sensitivity
"Of all the sick, how many did the test catch?"
THE FORMULA
Sensitivity = TP / (TP + FN)
True Positives divided by All Diseased
IN SIMPLE WORDS
If 100 people have cancer, and the test finds 90 of them, sensitivity = 90%.

High sensitivity = few false negatives = few missed cases.
Specificity
"Of all the healthy, how many did the test spare?"
THE FORMULA
Specificity = TN / (TN + FP)
True Negatives divided by All Non-Diseased
IN SIMPLE WORDS
If 100 people are healthy, and the test correctly says "negative" for 95 of them, specificity = 95%.

High specificity = few false positives = few false alarms.
The Cruel Trade-off
THE DILEMMA
A test cannot have both perfect sensitivity AND perfect specificity.

Lower the threshold to catch more sick people? You'll alarm more healthy people.

Raise the threshold to spare healthy people? You'll miss more sick people.

This is the threshold effect—the seesaw of diagnosis.
The Memory Rules
Sn

SnNout: Sensitive tests rule OUT

A highly sensitive test, when negative, rules out disease. If it didn't find it, it's probably not there.

Sp

SpPin: Specific tests rule IN

A highly specific test, when positive, rules in disease. If it says you have it, you probably do.

REMEMBER
SnNout: Sensitive Negative rules OUT
SpPin: Specific Positive rules IN
"Sensitivity catches the sick.
Specificity spares the well.
But no test masters both perfectly—
This is the burden we must bear."
In the year 2020, when pestilence swept the earth,
the world needed a test that could find the infected quickly.

But what if the rapid test missed too many?
The Promise of Rapid Testing
15 min
Results Time
$5
Cost per Test
Home
Self-Administered
THE HOPE
Rapid antigen tests could be used at home, at work, before gatherings. They could stop outbreaks before they started.
The Reality
COCHRANE SYSTEMATIC REVIEW, 2022
When researchers pooled data from 155 studies on COVID-19 rapid antigen tests:

In people WITH symptoms:
Sensitivity: 73% (missed 27% of cases)

In people WITHOUT symptoms:
Sensitivity: 55% (missed 45% of cases)

Nearly half of infected asymptomatic people were told they were clear.
Dinnes J et al. Cochrane Database Syst Rev. 2022;7:CD013705
The False Negatives Spread
1

Thanksgiving Dinners

Families tested negative in the morning, gathered indoors, unknowingly infected grandparents

2

Workplace Outbreaks

Workers tested negative, came to work, infected colleagues in the break room

3

Hospital Transmission

Patients tested negative, admitted to wards, infected vulnerable patients

THE LESSON
A 55% sensitivity means 45% false negative rate. False negatives are invisible. They feel safe. They spread disease.
"And the test said 'negative,'
and the family gathered,
and the grandfather embraced his grandchildren,
and by winter's end, he was gone."
Sensitivity and specificity describe the test.

But the patient asks a different question:

"I tested positive. What are my chances?"
The Disconnect
A DOCTOR'S DILEMMA
A test has 99% sensitivity and 95% specificity. Sounds excellent.

Your patient tests positive for a rare disease (prevalence 1 in 1000).

Question: What is the probability they actually have the disease?

Most doctors say 95%. The real answer? About 2%.
Likelihood Ratios
How much a test result changes the odds
POSITIVE LIKELIHOOD RATIO
LR+ = Sensitivity / (1 - Specificity)
How much more likely is a positive result in sick vs healthy?
NEGATIVE LIKELIHOOD RATIO
LR- = (1 - Sensitivity) / Specificity
How much more likely is a negative result in sick vs healthy?
What Good LRs Look Like
LR+ > 10
Strong rule-in
LR+ 5-10
Moderate rule-in
LR+ 2-5
Weak evidence
LR- < 0.1
Strong rule-out
LR- 0.1-0.2
Moderate rule-out
LR- 0.2-0.5
Weak evidence
"Sensitivity tells how many sick the test will catch.
Specificity tells how many well it will spare.
But only the likelihood ratio answers:
What does this result mean for THIS patient?"
Have you not heard the controversy of the screening
that found too much?

When does finding disease become causing harm?
The Dream of Early Detection
1970s - PRESENT
The logic was beautiful: find cancer early, treat it early, save lives.

Mammography could detect tumors too small to feel.

Women were told: "Annual mammograms save lives."

But what if some of those "cancers" would never have killed?
The Curse of Overdiagnosis
19-30%
of screen-detected breast cancers may be overdiagnosed
WHAT IS OVERDIAGNOSIS?
Finding a "cancer" that would never have caused symptoms or death during the person's lifetime.

The woman is diagnosed, treated with surgery, radiation, chemotherapy— for a disease that would never have harmed her.

Independent UK Panel on Breast Cancer Screening. Lancet. 2012;380:1778-1786

For Every 10,000 Women Screened
3-4
Lives saved
from breast cancer
~15
Overdiagnosed
(treated unnecessarily)
~500
False alarms
(anxiety, biopsies)
THE TRADE-OFF
To save 3-4 lives, you must accept that ~15 women will be treated for cancers that would never have harmed them, and ~500 women will experience false alarms.

Is this a good trade? The answer depends on values, not just numbers.
"And the test found what was hidden,
and called it disease,
and the woman was cut and burned and poisoned—
for a shadow that would never have darkened her days."

This is the problem of overdiagnosis.

One study may lie. One study may flatter.

But when you gather all the studies,
when you weigh their evidence together—

The truth becomes harder to hide.
Why Pool the Evidence?
1

More Precision

Combining studies gives narrower confidence intervals, reducing uncertainty

2

Detect Heterogeneity

Why do different studies give different answers? Setting? Population? Threshold?

3

Expose Publication Bias

Are negative studies being hidden? Funnel plots reveal asymmetry

4

Explore Thresholds

Build SROC curves to understand the sensitivity-specificity trade-off

The Bivariate Model
WHY IT'S SPECIAL
You cannot pool sensitivity and specificity separately.

They are correlated: when one goes up, the other tends to go down (the threshold effect).

The bivariate model accounts for this correlation, giving valid pooled estimates.

Reitsma JB et al. J Clin Epidemiol. 2005;58:982-990

The Summary ROC Curve

ROC Space

Perfect test 1-Specificity (FPR) →
Each dot = one study
The curve shows the trade-off
Higher = better test
↓ Sensitivity Useless test (chance line)
READING THE SROC
Top-left corner = perfect test (100% sens, 100% spec)
Diagonal line = useless test (random guessing)
The curve = summary of all studies' performance
"One study may deceive. Many studies, weighed together,
begin to reveal the truth.
The SROC curve is the path of evidence—
showing what the test can truly do."
But what if the studies disagree?

One study says sensitivity is 95%.
Another says 60%.

Which truth do you believe?
Heterogeneity: The Disagreement
DEFINITION
Heterogeneity is the variation between studies that cannot be explained by chance alone.

High heterogeneity means the studies are measuring different things— or the test performs differently in different settings.
Why Studies Disagree
T

Threshold Differences

Different cutoffs for "positive" result (e.g., different HbA1c thresholds for diabetes)

P

Population Differences

Disease severity, age, comorbidities differ between studies

S

Setting Differences

Primary care vs. specialist clinic vs. emergency room

Q

Quality Differences

Risk of bias, verification bias, spectrum bias

Measuring the Disagreement
I² < 25%
Low heterogeneity
Studies agree
I² 25-75%
Moderate
Some disagreement
I² > 75%
High heterogeneity
Major disagreement
THE WARNING
When I² > 75%, the pooled estimate may be meaningless.

You cannot average apples and oranges. You must explain why studies differ before pooling them.
"When the studies disagree,
do not silence the dissent.
Ask: Why do they see differently?
The disagreement itself teaches."
Your DTA Toolkit
The essential measures and when to use them
Essential Metrics
1

Sensitivity & Specificity

How well the test performs on sick vs. healthy people

2

Likelihood Ratios (LR+, LR-)

How much a result changes the probability of disease

3

Diagnostic Odds Ratio (DOR)

Single measure of test discrimination (DOR = LR+ / LR-)

4

Area Under the SROC Curve (AUC)

Overall test performance across all thresholds (0.5 = useless, 1.0 = perfect)

Tools of the Trade
mada
R package for
bivariate meta-analysis
metaDTA
Stata module
for DTA reviews
DTA Pro
Browser-based
open access tool
THE STANDARD REFERENCES
Reitsma et al. 2005 - Bivariate model
Rutter & Gatsonis 2001 - HSROC model
Cochrane Handbook Ch. 10 - DTA methods
Before You Trust a DTA Study

Was there a valid reference standard?

Gold standard test applied to all patients?

Were interpreters blinded?

Test readers unaware of diagnosis, and vice versa?

Was the spectrum appropriate?

Patients similar to your clinical population?

Was the threshold pre-specified?

Or was it chosen to maximize results?

"Armed with sensitivity, specificity, likelihood,
armed with the SROC and the measure of agreement,
you can see through the lie of the test—
and judge its truth for yourself."
Have you considered what happens when a test promises everything?

When a machine claims to see what no other machine can see,
and no one asks: "Show me the proof"?
The Theranos Mirage
REAL DATA: FDA INSPECTION FINDINGS
Theranos claimed: 200+ tests from a single finger prick

FDA found:
• Results varied by 146% between runs on the same sample
• Edison machines failed 87% of proficiency tests
Zero peer-reviewed validation studies published
• Patients received HIV-positive results for samples that were negative

Sources: FDA Warning Letter 2016; Carreyrou J. Bad Blood. 2018; CMS Inspection Reports.

The Decision Tree
You are a hospital administrator. Theranos offers you a contract.

What Do You Choose?

Theranos Contract Offered
PATH A: Trust Marketing Claims
Sign the Contract
Misdiagnose thousands
Face lawsuits
Harm patients
PATH B: Demand Validation Data
Find No Peer-Reviewed Studies
Reject the Contract
Protect Your Patients
Avoid Scandal
THE REVELATION
Elizabeth Holmes never validated a single test externally.

A $9 billion valuation became a criminal fraud conviction.

Every hospital that demanded validation data before signing
was protected from the lie.

Every hospital that trusted the marketing
became complicit in harming patients.
THE LESSON
No peer-reviewed validation = No trust.
The absence of evidence is not a marketing problem.
It is a patient safety emergency.
When speed defeats accuracy,
who pays the price?

The test result comes in 15 minutes.
But what if the result is 15 minutes of false confidence?
The COVID Rapid Test Chaos
REAL DATA: MANUFACTURER VS. REALITY
Manufacturer claims: 97% sensitivity

Real-world performance (Cochrane 2022):
• Symptomatic individuals: 73% sensitivity (missed 27%)
• Asymptomatic individuals: 58% sensitivity (missed 42%)
• Early infection (days 0-3): ~50% sensitivity

Nearly half of infected asymptomatic people were told they were "clear."

Source: Dinnes J et al. Cochrane Database Syst Rev. 2022;7:CD013705

The Decision Tree
You are a school nurse during a COVID surge. A symptomatic teacher tests negative on rapid test.

What Do You Choose?

Symptomatic Teacher + Negative Rapid Test
PATH A: Trust the Negative Result
Send Teacher to Class
Outbreak infects 30 students
School closure
Three hospitalizations
PATH B: Remember Sensitivity Limits
Require PCR Confirmation
PCR positive confirmed
Teacher isolates
Outbreak prevented
THE REVELATION
A negative test does not mean no infection.

It means: "not detected."

The difference between these two phrases
is measured in lives.
THE LESSON
When sensitivity is 58%, a negative result in a symptomatic person
is almost meaningless.

SnNout only works when sensitivity is HIGH.
Know your test's limits before trusting its verdict.
Can a test that finds cancer
still cause harm?

What if the cancer it finds
would never have hurt you?
The Mammography Controversy
REAL DATA: THE SCREENING PARADOX
Test characteristics:
Sensitivity: ~85% | Specificity: ~90%

For 1,000 women screened annually for 10 years:
1 death prevented from breast cancer
5 women overtreated for cancers that would never have harmed them
100-500 false alarms leading to biopsies, anxiety, repeat imaging

Overdiagnosis rate: 19-30% of screen-detected cancers

Source: Independent UK Panel on Breast Cancer Screening. Lancet. 2012;380:1778-1786

The Decision Tree
A 45-year-old woman asks your advice on mammography screening.

What Do You Choose?

Patient Asks: "Should I Get Annual Mammograms?"
PATH A: Recommend Without Context
Small Tumor Found
Mastectomy performed
Tumor was indolent (DCIS)
Would never have harmed her
PATH B: Explain Trade-offs
Shared Decision-Making
Patient makes informed choice
Understands benefits AND harms
Autonomy preserved
THE REVELATION
Test accuracy is not the only metric.

A test can be accurate and still cause harm.

When overdiagnosis exceeds lives saved,
we must ask: Is finding always helping?
THE LESSON
The harm from false positives and overdiagnosis
can exceed the benefit from true positives.

Always weigh benefits against harms.
Screening is not always saving.
What if finding the disease
is worse than missing it?

What if the treatment causes more suffering
than the disease ever would?
The PSA Paradox
REAL DATA: THE THRESHOLD TRAP
PSA cutoff of 4.0 ng/mL:
• Sensitivity for high-grade cancer: 21%
• Detects many indolent cancers that would never harm

Lower cutoff to 2.5 ng/mL:
• Sensitivity rises to: 40%
• But overdiagnosis doubles

Treatment consequences:
• 20-30% of men experience incontinence after prostatectomy
• 30-70% experience erectile dysfunction

Source: US Preventive Services Task Force. JAMA. 2018;319(18):1901-1913

The Decision Tree
You are setting PSA screening policy for a health system. Three paths lie before you.

What Threshold Do You Choose?

PSA Screening Policy Decision
PATH A: Low Threshold (2.5)
Maximize sensitivity
Thousands of unnecessary
biopsies and treatments
PATH B: High Threshold (4.0)
Miss some cancers
But most missed are indolent
Fewer unnecessary treatments
PATH C: Abandon PSA Screening
Miss aggressive cancers
Some preventable deaths
No overtreatment harm
THE REVELATION
There is no "right" cutoff.

Every threshold trades sensitivity for specificity,
detection for overdiagnosis.

The choice is not medical. It is ethical.
It depends on what harms you are willing to accept.
THE LESSON
The threshold effect is not a technical problem.
It is a values problem.

Before choosing a cutoff, ask:
What is worse: missing disease or overtreating the healthy?
The same test. The same cutoff.
Different truths.

How can identical numbers
mean opposite things?
The TB Test in Two Populations
REAL DATA: SAME TEST, DIFFERENT MEANING
Tuberculin Skin Test (10mm cutoff)
Sensitivity: ~80% | Specificity: ~95%

In high-prevalence setting (TB prevalence 10%):
• Positive Predictive Value: 85%
• A positive test usually means TB

In low-prevalence setting (TB prevalence 0.1%):
• Positive Predictive Value: 15%
• A positive test is usually a false positive

Source: Pai M et al. Lancet Infect Dis. 2014;14(8):765-773

The Decision Tree
You are a physician in London seeing an immigrant from a high-TB country. TST is positive.

What Do You Conclude?

Positive TST in Immigrant Patient
PATH A: Apply UK Population PPV
Assume "Probably False Positive"
Miss active TB
Patient infects family
Delays diagnosis for months
PATH B: Consider Patient's Pre-Test Probability
Recognize PPV Depends on Prevalence
Investigate further
Chest X-ray, sputum
Treat early if confirmed
THE REVELATION
Sensitivity and specificity are properties of the test.

PPV and NPV are properties of the population.

The same result means different things
in different people.
THE LESSON
Never interpret a test result without knowing the pre-test probability.

A positive test in a high-risk patient means disease.
The same positive in a low-risk patient means probably nothing.

Context is everything.
Five Stories, Five Lessons
1

Theranos: Demand Validation

No peer-reviewed data = no trust, regardless of marketing claims

2

COVID Rapid Tests: Know Sensitivity Limits

"Not detected" is not the same as "not infected"

3

Mammography: Weigh Benefits vs. Harms

Finding is not always helping; overdiagnosis causes real harm

4

PSA: The Threshold is a Values Choice

Every cutoff trades sensitivity for specificity; there is no "right" answer

5

TB Test: Context Determines Meaning

The same result means different things in different populations

References

Key Sources Cited in This Course

  1. Carreyrou J. Bad Blood: Secrets and Lies in a Silicon Valley Startup. Knopf, 2018.
  2. Dinnes J, et al. Rapid, point-of-care antigen tests for diagnosis of SARS-CoV-2 infection. Cochrane Database Syst Rev. 2022;7:CD013705.
  3. Independent UK Panel on Breast Cancer Screening. The benefits and harms of breast cancer screening. Lancet. 2012;380:1778-1786.
  4. Reitsma JB, et al. Bivariate analysis of sensitivity and specificity produces informative summary measures in diagnostic reviews. J Clin Epidemiol. 2005;58:982-990.
  5. Rutter CM, Gatsonis CA. A hierarchical regression approach to meta-analysis of diagnostic test accuracy evaluations. Stat Med. 2001;20:2865-2884.
  6. Deeks JJ, et al. The performance of tests of publication bias in systematic reviews of diagnostic test accuracy. J Clin Epidemiol. 2005;58:882-893.
  7. Macaskill P, et al. Cochrane Handbook for Systematic Reviews of Diagnostic Test Accuracy. Chapter 10. 2023.
  8. Higgins JPT, Thompson SG. Quantifying heterogeneity in a meta-analysis. Stat Med. 2002;21:1539-1558.
  9. US Food and Drug Administration. Warning Letter to Theranos Inc. 2016.
  10. US Preventive Services Task Force. Screening for Prostate Cancer. JAMA. 2018;319(18):1901-1913.
  11. Pai M, et al. Tuberculosis. Lancet Infect Dis. 2014;14(8):765-773.
In the Theranos story, what was the fundamental problem with their blood tests?
They were too expensive
They gave wrong results that could harm patients
They were too slow
They required too much blood
A COVID rapid test has 55% sensitivity in asymptomatic people. What does this mean?
55% of positive results are correct
The test works 55% of the time
The test misses 45% of infected asymptomatic people
55% of negative results are wrong
What does "SnNout" mean?
A highly Sensitive test, when Negative, rules OUT disease
A highly Specific test, when Negative, rules OUT disease
A sensitive test should be used for screening
Sensitivity and specificity are negatively correlated
Why can't you pool sensitivity and specificity separately in meta-analysis?
They use different scales
They are correlated due to the threshold effect
They come from different populations
It's computationally too difficult
Course Complete
"Now you know the four outcomes,
the two virtues of a test,
the cruel trade-off of the threshold,
and the art of pooling evidence.

When the next test lies to you—
you will know how to see through it."

When the Test Lies — Now You Know.