When The Test Lies: A Course on Diagnostic Test Accuracy

Have you not heard the tale of the woman
who promised to change the world with a drop of blood,
who raised billions on a test that never worked?

Palo Alto, 2003

STANFORD UNIVERSITY

A nineteen-year-old dropped out of college with a vision: a device that could run hundreds of blood tests from a single drop, pricked painlessly from a finger.

No more needles. No more vials. No more waiting.

Investors believed. Walgreens believed. The Pentagon believed.

They gave her $9 billion.

Carreyrou J. Bad Blood: Secrets and Lies in a Silicon Valley Startup. 2018

But the tests lied.

$9 Billion

Built on a test that didn't work

WHAT HAPPENED

The Theranos machines gave wrong results. Patients were told they had HIV when they didn't. Patients were told their blood was normal when they were dying. The company hid the truth for 15 years.

The Victims

PHOENIX, ARIZONA

A pregnant woman was told her baby would be born with a chromosomal abnormality. She was preparing for the worst. She was grieving.

The test was wrong. The baby was healthy.

But how many women, receiving the same news, made different decisions?

THE QUESTION

How do we know when a test is telling the truth? How do we measure a test's honesty?

"And the test lied,
and the lie was dressed in certainty,
and no one questioned the numbers."

This is why we study Diagnostic Test Accuracy.

When a test gives a result,
there are only four possible truths.

Two are blessings. Two are curses.

The Tree of Outcomes

Every Test Result Has a Reality Behind It

Patient Tested

↓

Test Says: POSITIVE

"You have it"

Test Says: NEGATIVE

"You're clear"

↓

Has disease
Test: Positive

No disease
Test: Positive

Has disease
Test: Negative

No disease
Test: Negative

The Four Truths

True Positive (TP)

Sick person correctly identified.
The test told the truth.

False Positive (FP)

Healthy person wrongly alarmed.
The test lied.

False Negative (FN)

Sick person wrongly reassured.
The deadliest lie.

True Negative (TN)

Healthy person correctly cleared.
The test told the truth.

The Sacred Table

The 2x2 Confusion Matrix

	Disease Present	Disease Absent
Test Positive	TP True Positive	FP False Positive
Test Negative	FN False Negative	TN True Negative

REMEMBER

Every diagnostic study must report these four numbers. Without them, you cannot judge a test.

"Two outcomes save. Two outcomes harm.
Know them by name.
TP, TN: the test spoke true.
FP, FN: the test lied."

A test has two virtues and two vices.

Sensitivity asks: Can it find the sick?

Specificity asks: Can it spare the healthy?

Sensitivity

"Of all the sick, how many did the test catch?"

THE FORMULA

Sensitivity = TP / (TP + FN)

True Positives divided by All Diseased

IN SIMPLE WORDS

If 100 people have cancer, and the test finds 90 of them, sensitivity = 90%.

High sensitivity = few false negatives = few missed cases.

Specificity

"Of all the healthy, how many did the test spare?"

THE FORMULA

Specificity = TN / (TN + FP)

True Negatives divided by All Non-Diseased

IN SIMPLE WORDS

If 100 people are healthy, and the test correctly says "negative" for 95 of them, specificity = 95%.

High specificity = few false positives = few false alarms.

The Cruel Trade-off

THE DILEMMA

A test cannot have both perfect sensitivity AND perfect specificity.

Lower the threshold to catch more sick people? You'll alarm more healthy people.

Raise the threshold to spare healthy people? You'll miss more sick people.

This is the threshold effect—the seesaw of diagnosis.

The Memory Rules

SnNout: Sensitive tests rule OUT

A highly sensitive test, when negative, rules out disease. If it didn't find it, it's probably not there.

SpPin: Specific tests rule IN

A highly specific test, when positive, rules in disease. If it says you have it, you probably do.

REMEMBER

SnNout: Sensitive Negative rules OUT
SpPin: Specific Positive rules IN

"Sensitivity catches the sick.
Specificity spares the well.
But no test masters both perfectly—
This is the burden we must bear."

In the year 2020, when pestilence swept the earth,
the world needed a test that could find the infected quickly.

But what if the rapid test missed too many?

The Promise of Rapid Testing

15 min

Results Time

Cost per Test

Home

Self-Administered

THE HOPE

Rapid antigen tests could be used at home, at work, before gatherings. They could stop outbreaks before they started.

The Reality

COCHRANE SYSTEMATIC REVIEW, 2022

When researchers pooled data from 155 studies on COVID-19 rapid antigen tests:

In people WITH symptoms:
Sensitivity: 73% (missed 27% of cases)

In people WITHOUT symptoms:
Sensitivity: 55% (missed 45% of cases)

Nearly half of infected asymptomatic people were told they were clear.

Dinnes J et al. Cochrane Database Syst Rev. 2022;7:CD013705

The False Negatives Spread

Thanksgiving Dinners

Families tested negative in the morning, gathered indoors, unknowingly infected grandparents

Workplace Outbreaks

Workers tested negative, came to work, infected colleagues in the break room

Hospital Transmission

Patients tested negative, admitted to wards, infected vulnerable patients

THE LESSON

A 55% sensitivity means 45% false negative rate. False negatives are invisible. They feel safe. They spread disease.

"And the test said 'negative,'
and the family gathered,
and the grandfather embraced his grandchildren,
and by winter's end, he was gone."

Sensitivity and specificity describe the test.

But the patient asks a different question:

"I tested positive. What are my chances?"

The Disconnect

A DOCTOR'S DILEMMA

A test has 99% sensitivity and 95% specificity. Sounds excellent.

Your patient tests positive for a rare disease (prevalence 1 in 1000).

Question: What is the probability they actually have the disease?

Most doctors say 95%. The real answer? About 2%.

Likelihood Ratios

How much a test result changes the odds

POSITIVE LIKELIHOOD RATIO

LR+ = Sensitivity / (1 - Specificity)

How much more likely is a positive result in sick vs healthy?

NEGATIVE LIKELIHOOD RATIO

LR- = (1 - Sensitivity) / Specificity

How much more likely is a negative result in sick vs healthy?

What Good LRs Look Like

LR+ > 10

Strong rule-in

LR+ 5-10

Moderate rule-in

LR+ 2-5

Weak evidence

LR- < 0.1

Strong rule-out

LR- 0.1-0.2

Moderate rule-out

LR- 0.2-0.5

Weak evidence

"Sensitivity tells how many sick the test will catch.
Specificity tells how many well it will spare.
But only the likelihood ratio answers:
What does this result mean for THIS patient?"

Have you not heard the controversy of the screening
that found too much?

When does finding disease become causing harm?

The Dream of Early Detection

1970s - PRESENT

The logic was beautiful: find cancer early, treat it early, save lives.

Mammography could detect tumors too small to feel.

Women were told: "Annual mammograms save lives."

But what if some of those "cancers" would never have killed?

The Curse of Overdiagnosis

19-30%

of screen-detected breast cancers may be overdiagnosed

WHAT IS OVERDIAGNOSIS?

Finding a "cancer" that would never have caused symptoms or death during the person's lifetime.

The woman is diagnosed, treated with surgery, radiation, chemotherapy— for a disease that would never have harmed her.

Independent UK Panel on Breast Cancer Screening. Lancet. 2012;380:1778-1786

For Every 10,000 Women Screened

3-4

Lives saved
from breast cancer

~15

Overdiagnosed
(treated unnecessarily)

~500

False alarms
(anxiety, biopsies)

THE TRADE-OFF

To save 3-4 lives, you must accept that ~15 women will be treated for cancers that would never have harmed them, and ~500 women will experience false alarms.

Is this a good trade? The answer depends on values, not just numbers.

"And the test found what was hidden,
and called it disease,
and the woman was cut and burned and poisoned—
for a shadow that would never have darkened her days."

This is the problem of overdiagnosis.

One study may lie. One study may flatter.

But when you gather all the studies,
when you weigh their evidence together—

The truth becomes harder to hide.

Why Pool the Evidence?

More Precision

Combining studies gives narrower confidence intervals, reducing uncertainty

Detect Heterogeneity

Why do different studies give different answers? Setting? Population? Threshold?

Expose Publication Bias

Are negative studies being hidden? Funnel plots reveal asymmetry

Explore Thresholds

Build SROC curves to understand the sensitivity-specificity trade-off

The Bivariate Model

WHY IT'S SPECIAL

You cannot pool sensitivity and specificity separately.

They are correlated: when one goes up, the other tends to go down (the threshold effect).

The bivariate model accounts for this correlation, giving valid pooled estimates.

Reitsma JB et al. J Clin Epidemiol. 2005;58:982-990

The Summary ROC Curve

ROC Space

Perfect test 1-Specificity (FPR) →

Each dot = one study
The curve shows the trade-off
Higher = better test

↓ Sensitivity Useless test (chance line)

READING THE SROC

Top-left corner = perfect test (100% sens, 100% spec)
Diagonal line = useless test (random guessing)
The curve = summary of all studies' performance

"One study may deceive. Many studies, weighed together,
begin to reveal the truth.
The SROC curve is the path of evidence—
showing what the test can truly do."

But what if the studies disagree?

One study says sensitivity is 95%.
Another says 60%.

Which truth do you believe?

Heterogeneity: The Disagreement

DEFINITION

Heterogeneity is the variation between studies that cannot be explained by chance alone.

High heterogeneity means the studies are measuring different things— or the test performs differently in different settings.

Why Studies Disagree

Threshold Differences

Different cutoffs for "positive" result (e.g., different HbA1c thresholds for diabetes)

Population Differences

Disease severity, age, comorbidities differ between studies

Setting Differences

Primary care vs. specialist clinic vs. emergency room

Quality Differences

Risk of bias, verification bias, spectrum bias

Measuring the Disagreement

I² < 25%

Low heterogeneity
Studies agree

I² 25-75%

Moderate
Some disagreement

I² > 75%

High heterogeneity
Major disagreement

THE WARNING

When I² > 75%, the pooled estimate may be meaningless.

You cannot average apples and oranges. You must explain why studies differ before pooling them.

"When the studies disagree,
do not silence the dissent.
Ask: Why do they see differently?
The disagreement itself teaches."

Your DTA Toolkit

The essential measures and when to use them

Essential Metrics

Sensitivity & Specificity

How well the test performs on sick vs. healthy people

Likelihood Ratios (LR+, LR-)

How much a result changes the probability of disease

Diagnostic Odds Ratio (DOR)

Single measure of test discrimination (DOR = LR+ / LR-)

Area Under the SROC Curve (AUC)

Overall test performance across all thresholds (0.5 = useless, 1.0 = perfect)

Tools of the Trade

mada

R package for
bivariate meta-analysis

metaDTA

Stata module
for DTA reviews

DTA Pro

Browser-based
open access tool

THE STANDARD REFERENCES

Reitsma et al. 2005 - Bivariate model
Rutter & Gatsonis 2001 - HSROC model
Cochrane Handbook Ch. 10 - DTA methods

Before You Trust a DTA Study

✓

Was there a valid reference standard?

Gold standard test applied to all patients?

✓

Were interpreters blinded?

Test readers unaware of diagnosis, and vice versa?

✓

Was the spectrum appropriate?

Patients similar to your clinical population?

✓

Was the threshold pre-specified?

Or was it chosen to maximize results?

"Armed with sensitivity, specificity, likelihood,
armed with the SROC and the measure of agreement,
you can see through the lie of the test—
and judge its truth for yourself."

Have you considered what happens when a test promises everything?

When a machine claims to see what no other machine can see,
and no one asks: "Show me the proof"?

The Theranos Mirage

REAL DATA: FDA INSPECTION FINDINGS

Theranos claimed: 200+ tests from a single finger prick

FDA found:
• Results varied by 146% between runs on the same sample
• Edison machines failed 87% of proficiency tests
• Zero peer-reviewed validation studies published
• Patients received HIV-positive results for samples that were negative

Sources: FDA Warning Letter 2016; Carreyrou J. Bad Blood. 2018; CMS Inspection Reports.

The Decision Tree

You are a hospital administrator. Theranos offers you a contract.

What Do You Choose?

Theranos Contract Offered

↓

PATH A: Trust Marketing Claims

↓

Sign the Contract

↓

Misdiagnose thousands
Face lawsuits
Harm patients

PATH B: Demand Validation Data

↓

Find No Peer-Reviewed Studies

↓

Reject the Contract
Protect Your Patients
Avoid Scandal

THE REVELATION

Elizabeth Holmes never validated a single test externally.

A $9 billion valuation became a criminal fraud conviction.

Every hospital that demanded validation data before signing
was protected from the lie.

Every hospital that trusted the marketing
became complicit in harming patients.

THE LESSON

No peer-reviewed validation = No trust.
The absence of evidence is not a marketing problem.
It is a patient safety emergency.

When speed defeats accuracy,
who pays the price?

The test result comes in 15 minutes.
But what if the result is 15 minutes of false confidence?

The COVID Rapid Test Chaos

REAL DATA: MANUFACTURER VS. REALITY

Manufacturer claims: 97% sensitivity

Real-world performance (Cochrane 2022):
• Symptomatic individuals: 73% sensitivity (missed 27%)
• Asymptomatic individuals: 58% sensitivity (missed 42%)
• Early infection (days 0-3): ~50% sensitivity

Nearly half of infected asymptomatic people were told they were "clear."

Source: Dinnes J et al. Cochrane Database Syst Rev. 2022;7:CD013705

The Decision Tree

You are a school nurse during a COVID surge. A symptomatic teacher tests negative on rapid test.

What Do You Choose?

Symptomatic Teacher + Negative Rapid Test

↓

PATH A: Trust the Negative Result

↓

Send Teacher to Class

↓

Outbreak infects 30 students
School closure
Three hospitalizations

PATH B: Remember Sensitivity Limits

↓

Require PCR Confirmation

↓

PCR positive confirmed
Teacher isolates
Outbreak prevented

THE REVELATION

A negative test does not mean no infection.

It means: "not detected."

The difference between these two phrases
is measured in lives.

THE LESSON

When sensitivity is 58%, a negative result in a symptomatic person
is almost meaningless.

SnNout only works when sensitivity is HIGH.
Know your test's limits before trusting its verdict.

Can a test that finds cancer
still cause harm?

What if the cancer it finds
would never have hurt you?

The Mammography Controversy

REAL DATA: THE SCREENING PARADOX

Test characteristics:
Sensitivity: ~85% | Specificity: ~90%

For 1,000 women screened annually for 10 years:
• 1 death prevented from breast cancer
• 5 women overtreated for cancers that would never have harmed them
• 100-500 false alarms leading to biopsies, anxiety, repeat imaging

Overdiagnosis rate: 19-30% of screen-detected cancers

Source: Independent UK Panel on Breast Cancer Screening. Lancet. 2012;380:1778-1786

The Decision Tree

A 45-year-old woman asks your advice on mammography screening.

What Do You Choose?

Patient Asks: "Should I Get Annual Mammograms?"

↓

PATH A: Recommend Without Context

↓

Small Tumor Found

↓

Mastectomy performed
Tumor was indolent (DCIS)
Would never have harmed her

PATH B: Explain Trade-offs

↓

Shared Decision-Making

↓

Patient makes informed choice
Understands benefits AND harms
Autonomy preserved

THE REVELATION

Test accuracy is not the only metric.

A test can be accurate and still cause harm.

When overdiagnosis exceeds lives saved,
we must ask: Is finding always helping?

THE LESSON

The harm from false positives and overdiagnosis
can exceed the benefit from true positives.

Always weigh benefits against harms.
Screening is not always saving.

What if finding the disease
is worse than missing it?

What if the treatment causes more suffering
than the disease ever would?

The PSA Paradox

REAL DATA: THE THRESHOLD TRAP

PSA cutoff of 4.0 ng/mL:
• Sensitivity for high-grade cancer: 21%
• Detects many indolent cancers that would never harm

Lower cutoff to 2.5 ng/mL:
• Sensitivity rises to: 40%
• But overdiagnosis doubles

Treatment consequences:
• 20-30% of men experience incontinence after prostatectomy
• 30-70% experience erectile dysfunction

Source: US Preventive Services Task Force. JAMA. 2018;319(18):1901-1913

The Decision Tree

You are setting PSA screening policy for a health system. Three paths lie before you.

What Threshold Do You Choose?

PSA Screening Policy Decision

↓

PATH A: Low Threshold (2.5)

↓

Maximize sensitivity
Thousands of unnecessary
biopsies and treatments

PATH B: High Threshold (4.0)

↓

Miss some cancers
But most missed are indolent
Fewer unnecessary treatments

PATH C: Abandon PSA Screening

↓

Miss aggressive cancers
Some preventable deaths
No overtreatment harm

THE REVELATION

There is no "right" cutoff.

Every threshold trades sensitivity for specificity,
detection for overdiagnosis.

The choice is not medical. It is ethical.
It depends on what harms you are willing to accept.

THE LESSON

The threshold effect is not a technical problem.
It is a values problem.

Before choosing a cutoff, ask:
What is worse: missing disease or overtreating the healthy?

The same test. The same cutoff.
Different truths.

How can identical numbers
mean opposite things?

The TB Test in Two Populations

REAL DATA: SAME TEST, DIFFERENT MEANING

Tuberculin Skin Test (10mm cutoff)
Sensitivity: ~80% | Specificity: ~95%

In high-prevalence setting (TB prevalence 10%):
• Positive Predictive Value: 85%
• A positive test usually means TB

In low-prevalence setting (TB prevalence 0.1%):
• Positive Predictive Value: 15%
• A positive test is usually a false positive

Source: Pai M et al. Lancet Infect Dis. 2014;14(8):765-773

The Decision Tree

You are a physician in London seeing an immigrant from a high-TB country. TST is positive.

What Do You Conclude?

Positive TST in Immigrant Patient

↓

PATH A: Apply UK Population PPV

↓

Assume "Probably False Positive"

↓

Miss active TB
Patient infects family
Delays diagnosis for months

PATH B: Consider Patient's Pre-Test Probability

↓

Recognize PPV Depends on Prevalence

↓

Investigate further
Chest X-ray, sputum
Treat early if confirmed

THE REVELATION

Sensitivity and specificity are properties of the test.

PPV and NPV are properties of the population.

The same result means different things
in different people.

THE LESSON

Never interpret a test result without knowing the pre-test probability.

A positive test in a high-risk patient means disease.
The same positive in a low-risk patient means probably nothing.

Context is everything.

Five Stories, Five Lessons

Theranos: Demand Validation

No peer-reviewed data = no trust, regardless of marketing claims

COVID Rapid Tests: Know Sensitivity Limits

"Not detected" is not the same as "not infected"

Mammography: Weigh Benefits vs. Harms

Finding is not always helping; overdiagnosis causes real harm

PSA: The Threshold is a Values Choice

Every cutoff trades sensitivity for specificity; there is no "right" answer

TB Test: Context Determines Meaning

The same result means different things in different populations

References

Key Sources Cited in This Course

Carreyrou J. Bad Blood: Secrets and Lies in a Silicon Valley Startup. Knopf, 2018.
Dinnes J, et al. Rapid, point-of-care antigen tests for diagnosis of SARS-CoV-2 infection. Cochrane Database Syst Rev. 2022;7:CD013705.
Independent UK Panel on Breast Cancer Screening. The benefits and harms of breast cancer screening. Lancet. 2012;380:1778-1786.
Reitsma JB, et al. Bivariate analysis of sensitivity and specificity produces informative summary measures in diagnostic reviews. J Clin Epidemiol. 2005;58:982-990.
Rutter CM, Gatsonis CA. A hierarchical regression approach to meta-analysis of diagnostic test accuracy evaluations. Stat Med. 2001;20:2865-2884.
Deeks JJ, et al. The performance of tests of publication bias in systematic reviews of diagnostic test accuracy. J Clin Epidemiol. 2005;58:882-893.
Macaskill P, et al. Cochrane Handbook for Systematic Reviews of Diagnostic Test Accuracy. Chapter 10. 2023.
Higgins JPT, Thompson SG. Quantifying heterogeneity in a meta-analysis. Stat Med. 2002;21:1539-1558.
US Food and Drug Administration. Warning Letter to Theranos Inc. 2016.
US Preventive Services Task Force. Screening for Prostate Cancer. JAMA. 2018;319(18):1901-1913.
Pai M, et al. Tuberculosis. Lancet Infect Dis. 2014;14(8):765-773.

In the Theranos story, what was the fundamental problem with their blood tests?

They were too expensive

They gave wrong results that could harm patients

They were too slow

They required too much blood

A COVID rapid test has 55% sensitivity in asymptomatic people. What does this mean?

55% of positive results are correct

The test works 55% of the time

The test misses 45% of infected asymptomatic people

55% of negative results are wrong

What does "SnNout" mean?

A highly Sensitive test, when Negative, rules OUT disease

A highly Specific test, when Negative, rules OUT disease

A sensitive test should be used for screening

Sensitivity and specificity are negatively correlated

Why can't you pool sensitivity and specificity separately in meta-analysis?

They use different scales

They are correlated due to the threshold effect

They come from different populations

It's computationally too difficult

✔

Course Complete

"Now you know the four outcomes,
the two virtues of a test,
the cruel trade-off of the threshold,
and the art of pooling evidence.

When the next test lies to you—
you will know how to see through it."

When the Test Lies — Now You Know.

1 / 5