who promised to change the world with a drop of blood,
who raised billions on a test that never worked?
Investors believed. Walgreens believed. The Pentagon believed.
They gave her $9 billion.
But the tests gave wrong results. Patients were told they had HIV when they didn't. Patients were told their blood was normal when they were dying.
What Theranos Did vs. What Should Happen
and the lie was dressed in certainty,
and no one asked for the 2×2 table."
This is why we study Diagnostic Test Accuracy.
there are only four possible truths.
Two are blessings. Two are curses.
What happens when a systematic review trusts every study equally?
REAL DATA
Sensitivity analyses in DTA systematic reviews consistently demonstrate that excluding high risk-of-bias studies changes pooled estimates. In mammography screening, case-control designs with unblinded interpretation tend to inflate sensitivity. The general principle is well-documented: QUADAS-2 quality assessment can shift pooled sensitivity by 10-15 percentage points when biased studies are removed.
Every Test Result Has a Reality Behind It
HIV Rapid Test Example (Real Data)
| HIV+ | HIV- | Total | |
|---|---|---|---|
| Test + | 98 | 3 | 101 |
| Test - | 2 | 895 | 897 |
| Total | 100 | 898 | 998 |
Specificity = 895/898 = 99.7%
TP, TN: the test spoke true.
FP, FN: the test lied.
Know them by name, for they determine fate."
found clean,
and given to thousands—
while death swam within it?
But the test had a window period—weeks after infection when the virus was present but undetectable.
Blood was tested. Blood was "negative." Blood was transfused.
8,000-12,000 Americans were infected through transfusions before better tests closed the window.
Why False Negatives Are Deadly
Eclipse period
Seroconversion
Most detected
Window closed
for the virus had not yet shown its face.
And the blood was shared,
and the infection spread to the innocent."
to protect their pregnancies,
that planted cancer in their daughters
twenty years before it bloomed?
No proper clinical trial was ever conducted. Doctors assumed it worked because it seemed reasonable.
Decades later, their daughters developed a rare cancer: clear cell adenocarcinoma of the vagina. A cancer so rare it was a diagnostic signal in itself.
5-10 million women were exposed. The harm crossed generations.
What Should Have Happened
The cluster itself was the diagnostic test:
Sensitivity to DES exposure: nearly 100%
If you have this cancer at this age, you were almost certainly exposed.
cancer in DES daughters
worldwide
and the daughters grew in shadow,
and twenty years later the cancer bloomed—
a diagnosis that indicted a generation of medicine."
Sensitivity: Can it find the sick?
Specificity: Can it spare the healthy?
Can you trust a sensitivity number from a laboratory when the test is used in the real world?
REAL DATA
The BinaxNOW COVID-19 rapid antigen test reported sensitivity of approximately 84-97% in symptomatic individuals in manufacturer studies. However, real-world evaluations found sensitivity as low as 35-64% in asymptomatic individuals, depending on viral load and timing. The Cochrane review of rapid antigen tests (Dinnes 2022) confirmed average sensitivity of 73% in symptomatic and only 55% in asymptomatic populations across over 100 study evaluations.
Worked Example: COVID PCR Test
Worked Example: Same COVID PCR Test
When to Use Which Test
Specificity spares the well.
But no test masters both perfectly—
this is the burden we bear."
who saw 99% accurate
and believed a positive result meant 99% certainty?
This is the deadliest error in medicine.
A test is 99% sensitive and 99% specific.
A patient tests positive.
What is the probability they have the disease?
Most doctors say ~99%. The real answer is about 9%.
Testing 100,000 People (Prevalence 1/1000)
See How Prevalence Changes PPV
Same Test, Different Settings
0.1%
10%
50%
and the patient heard '99% certain,'
and both were deceived—
for they forgot to ask: How rare is this disease?"
that could find TB in two hours,
that was called revolutionary—
but missed the drug-resistant strains?
South Africa deployed it nationwide. The WHO endorsed it.
But in patients with low bacterial loads—often HIV co-infected— sensitivity dropped to 67%. One in three cases missed.
And for detecting rifampicin resistance, it missed 5% of resistant cases. Those patients received the wrong treatment. The resistant TB spread.
When GeneXpert Is Not Enough
(high bacterial load)
(low bacterial load)
(immune suppressed)
and the doctor believed the machine,
and the patient went home with TB in their lungs,
coughing resistance into the world."
that found cancers that would never kill,
and led to treatments that destroyed lives?
Doctors screened millions of men. Cancers were found. Prostates were removed.
But many of these "cancers" would never have caused symptoms. The surgery caused impotence and incontinence in men who would have died of old age, not cancer.
prostate cancer
per 1000 screened
or incontinent
per 1000 screened
(biopsies, anxiety)
per 1000 screened
If 1,000 Men Age 55-69 Are Screened for 13 Years
and the surgeon cut,
and the man lived—impotent, incontinent—
from a cancer that would never have woken."
whose first troponin was normal,
who was sent home—
and died before morning?
A patient arrives one hour after chest pain begins. Troponin is tested: normal. "You're fine. Go home."
The heart was dying. The protein hadn't leaked yet.
Studies show 2-5% of MI patients sent home from ED die within 30 days.
The Two-Troponin Protocol
sensitivity at 0 hrs
sensitivity at 0 hrs
at 3 hrs serial
for the heart had just begun to die.
And the patient was reassured,
and went home to finish dying."
Specificity describes the test.
But the patient asks:
"I tested positive. What are MY chances?"
What if the published sensitivity of a test is higher than the truth, and the likelihood ratios you calculate are therefore wrong?
REAL DATA
Rapid strep tests (RADT) showed pooled sensitivity of approximately 86% in published studies included in Cochrane reviews. However, FDA 510(k) regulatory submissions, which include unpublished manufacturer data, revealed sensitivity estimates of only 70-75%. Published studies with higher sensitivity were more likely to be submitted for publication—a classic case of publication bias inflating the apparent accuracy.
From Pre-Test to Post-Test Probability
Probability
Ratio
Probability
How Powerful Is This Test?
Specificity tells of the well.
But the likelihood ratio answers:
What does this result mean for THIS patient?"
the rapid test that said negative,
and the Plasmodium that kept multiplying?
Rapid Diagnostic Tests were meant to guide treatment in remote areas without microscopes or laboratories.
But when parasitemia is low—the RDT misses cases. And when P. falciparum deletes the HRP2 gene— the RDT sees nothing at all.
Child with Fever in Malaria-Endemic Area
(>200/μL)
(100-200/μL)
(<100/μL)
and the child was sent home,
and the parasites multiplied in the dark,
and by morning the child could not wake."
the world needed a test that was fast.
But fast is not the same as accurate.
When a new generation of test arrives with higher sensitivity, does that automatically make it better?
REAL DATA
High-sensitivity troponin (hs-cTn) assays increased sensitivity for acute myocardial infarction from approximately 70% (conventional troponin at presentation) to over 95%. But specificity dropped from approximately 95% to around 80% because hs-cTn detects myocardial injury from many non-MI causes (heart failure, sepsis, renal disease, pulmonary embolism). The net clinical effect required HSROC modeling across multiple studies to understand the tradeoff.
COVID-19 Rapid Antigen Tests (Dinnes 2022 Cochrane Review)
| Population | Sensitivity | Missed |
|---|---|---|
| Symptomatic | 73% | 27% |
| Asymptomatic | 55% | 45% |
| First 7 days | 80% | 20% |
Dinnes J et al. Cochrane Database Syst Rev. 2022;7:CD013705
Thanksgiving 2020: What Happened
and the family embraced,
and by winter's end,
the grandfather was buried."
that found cancers that would never kill,
and led to treatments that caused more harm than the disease?
Can you trust a DTA meta-analysis done in a spreadsheet?
REAL DATA
DTA meta-analysis requires the bivariate model or HSROC—both need maximum likelihood estimation of correlated sensitivity and specificity on the logit scale. Research has documented that manual Excel calculations frequently introduce errors: a landmark study by Reinhart & Rogoff (2010, economics) demonstrated how a simple spreadsheet error led to global policy changes. In DTA, manually applying logit transformations and pooling sensitivity/specificity separately in Excel ignores the correlation between them, and can produce pooled estimates that differ meaningfully from validated bivariate models in software (R mada/reitsma, Stata metandi, SAS NLMIXED).
per 10,000 screened
(treated unnecessarily)
(anxiety, biopsies)
Is this trade-off worth it?
If 10,000 Women Age 50-69 Are Screened for 10 Years
10,000 Women Screened Over 10 Years
Alarm
~50 cancer
and called it cancer,
and the woman was cut and burned—
for a shadow that would never have darkened her days."
that finds the plaques in the brain,
but cannot tell you
if the mind will fade?
But 30% of cognitively normal elderly have amyloid plaques. They may never develop dementia.
And 10-20% of people with dementia have no amyloid.
The test finds the plaques. But plaques are not the disease. We are testing for a surrogate, not the outcome.
What Are We Really Testing For?
and the doctor named it Alzheimer's,
and the patient lived in terror—
of a forgetting that might never come."
Some are biased.
Some are poorly designed.
Some should not be trusted.
How do we separate the wheat from the chaff?
What if most DTA studies do not even report enough information to judge their quality?
REAL DATA
Before the STARD initiative was published in 2003, a systematic assessment found that fewer than half of DTA studies reported whether index test interpretation was blinded, and reference standard descriptions were frequently inadequate. After STARD, reporting improved: multiple meta-epidemiological assessments found adherence to STARD items rose substantially, though many studies still fell short on key items such as flow diagrams and indeterminate result handling.
Four Domains of Risk of Bias
Patient Selection
Was a consecutive or random sample enrolled? Was a case-control design avoided?
Index Test
Was the test interpreted without knowledge of the reference standard? Was the threshold pre-specified?
Reference Standard
Is the reference standard likely to correctly classify the condition? Was it interpreted blind?
Flow and Timing
Was there appropriate interval between tests? Did all patients receive the same reference standard?
Should You Trust This Study?
Verification Bias
Only positive tests get the reference standard → inflates sensitivity
Spectrum Bias
Study population differs from clinical reality → results don't generalize
Incorporation Bias
Index test is part of reference standard → artificially high accuracy
Review Bias
Index test interpreted knowing reference result → inflates both metrics
ask: How were they gathered?
A biased study speaks with confidence—
but its confidence is a lie."
One study may flatter.
But when you gather all the evidence—
the truth becomes harder to hide.
What happens when different studies use different thresholds for the same test, and you try to pool them?
REAL DATA
D-dimer testing for pulmonary embolism (PE) traditionally used a fixed cutoff of 500 µg/L. The ADJUST-PE trial (Righini et al., JAMA 2014) showed that an age-adjusted cutoff (age × 10 µg/L for patients over 50) increased the proportion of elderly patients with negative D-dimer results from ~6% to ~30%, with a 3-month VTE risk of only 0.3% in the age-adjusted negative group. A DTA meta-analysis of D-dimer studies must use the bivariate model because different thresholds create a sensitivity-specificity tradeoff visible on the SROC curve.
You cannot pool them separately like treatment effects. You need the bivariate model.
Summary Receiver Operating Characteristic
What Does the Curve Tell You?
Many studies, weighed together,
trace the path of truth—
the SROC curve that reveals what the test can truly do."
One says sensitivity is 95%.
Another says 60%.
Which truth do you believe?
What if a test works well in the general population but fails in the patients who need it most?
REAL DATA
HRP2-based malaria rapid diagnostic tests (RDTs) achieve sensitivity of approximately 95% in the general population in endemic areas. However, in pregnant women, sensitivity can drop to as low as 56-76% due to placental sequestration of parasites—the parasites hide in the placenta, keeping peripheral blood parasitemia low and below the RDT detection threshold. A Cochrane review of malaria RDTs found substantial heterogeneity (I² often exceeding 80%) driven by population subgroups including pregnancy, children under 5, and HIV co-infection.
Why Studies Disagree
Studies agree
Some variation
Major disagreement
do not silence the dissent.
Ask: Why do they see differently?
The disagreement itself teaches."
When an AI claims to diagnose better than doctors, should you trust the overall AUC?
REAL DATA
Deep learning models for skin cancer detection have reported AUC values as high as 0.91-0.94 in development datasets. However, external validation revealed alarming disparities: Daneshjou et al. (2022, Nature Medicine) found that commercial AI dermatology tools performed at near-chance levels on darker skin (Fitzpatrick types V-VI), with AUC as low as 0.50-0.57 — essentially random. Training datasets were heavily biased toward lighter skin tones, meaning the 2x2 table was never properly filled for all populations.
Was there a valid reference standard?
Gold standard applied to ALL patients?
Were interpreters blinded?
Test readers unaware of diagnosis?
Was the spectrum appropriate?
Patients similar to your population?
Was the threshold pre-specified?
Or chosen to maximize results?
The Clinical Override Decision Tree
When One Test Isn't Enough
armed with the SROC and the measure of agreement,
you can see through the lie of the test—
and judge its truth for yourself."
who received the wrong blood,
not because the test was wrong,
but because no one performed it?
Yet transfusion reactions still kill—not from test failure, but from human failure:
• Wrong blood drawn from wrong patient
• Labels switched in the lab
• Bedside check skipped in emergency
In the UK, 1 in 13,000 transfusions goes to the wrong patient. The test worked. The system failed.
Where Can Things Go Wrong?
if the wrong blood is drawn,
the wrong label is applied,
the wrong bag is hung."
DTA studies measure test accuracy. They do not measure system accuracy.
Key Sources
- Carreyrou J. Bad Blood. Knopf, 2018. [Theranos]
- CDC. MMWR. 1987;36(49):833-840. [HIV blood supply]
- Herbst AL et al. N Engl J Med. 1971;284:878-881. [DES]
- Moyer VA. Ann Intern Med. 2012;157:120-134. [PSA]
- Pope JH et al. N Engl J Med. 2000;342:1163-1170. [Troponin]
- Steingart KR et al. Cochrane 2014;1:CD009593. [GeneXpert]
- Dinnes J et al. Cochrane 2022;7:CD013705. [COVID RAT]
- UK Panel. Lancet. 2012;380:1778-1786. [Mammography]
- Jack CR et al. Lancet Neurol. 2018;17:760-773. [Amyloid]
- WHO. Malaria RDT Performance. 2022.
- Reitsma JB et al. J Clin Epidemiol. 2005;58:982-990. [Bivariate]
- Whiting PF et al. Ann Intern Med. 2011;155:529-536. [QUADAS-2]
- Bolton-Maggs PHB. Transfus Med. 2016;26:303-311.
the two virtues of a test,
the fallacy of the base rate,
the art of pooling evidence,
and the biases that hide the truth.
When the next test lies to you—
you will know."