当测试撒谎时：诊断测试准确性课程

==================== 模块 1：欺诈 (Theranos) ====================

你没听过这个故事吗？女人
who promised to 用一滴血改变世界,
who raised billions on a test that never worked?

幻灯片 1.2：Theranos 故事

Palo Alto, 2003

STANFORD UNIVERSITY

一名 19 岁的年轻人从大学辍学，怀揣着一个愿景：一种可以通过手指无痛刺破的一滴血进行数百次血液测试的设备。

No more needles. No more vials. No more waiting.

Investors believed. Walgreens believed. The Pentagon believed.

They gave her $9 billion.

Carreyrou J. Bad Blood: Secrets and Lies in a Silicon Valley Startup. 2018

幻灯片 1.3：欺诈

但测试 lied.

$9 Billion

Built on a test that didn't work

WHAT HAPPENED

Theranos 机器给出了 wrong results。患者被告知他们感染了艾滋病毒，但实际上他们并没有感染。患者被告知他们的血液是正常的 dying。该公司隐瞒真相 15 years.

幻灯片 1.4：受害者

The Victims

PHOENIX, ARIZONA

一名孕妇被告知她的孩子出生时会出现染色体异常。她正在做最坏的打算。她很悲伤。

测试是错误的。宝宝很健康。

But how many women, receiving the same news, made different decisions?

THE QUESTION

How do we know when a test is telling the truth? How do we measure a test's honesty?

幻灯片 1.5：副歌

时，患者被告知他们的血液是正常的，并且测试撒了谎，
并且谎言是确定无疑的，
没有人质疑这些数字。”

这就是我们研究诊断测试准确性的原因。

======================模块 2：四个结果====================

When a test gives a result,
只有 four possible truths.

两个是祝福。其中两个是诅咒。

幻灯片 2.2：决策树

结果树

Every Test Result Has a Reality Behind It

Patient Tested

↓

Test Says: POSITIVE

"You have it"

Test Says: NEGATIVE

"You're clear"

↓

Has disease
Test: Positive

No disease
Test: Positive

Has disease
Test: Negative

No disease
Test: Negative

幻灯片 2.3：四个象限

四真理

True Positive (TP)

Sick person correctly identified.
测试说出了真相。

False Positive (FP)

Healthy person wrongly alarmed.
测试撒谎了。

False Negative (FN)

Sick person wrongly reassured.
最致命的谎言。

True Negative (TN)

Healthy person correctly cleared.
测试说出了真相。

幻灯片 2.4：2x2 桌子

神圣的桌子

2x2 混淆矩阵

	Disease Present	Disease Absent
Test Positive	TP True Positive	FP False Positive
Test Negative	FN False Negative	TN True Negative

REMEMBER

每项诊断研究都必须报告这些 four numbers. Without them, you cannot judge a test.

幻灯片 2.5：副歌

"Two outcomes save. Two outcomes harm.
Know them by name.
TP， TN：测试说的是真。
FP、FN：测试撒谎了。”

==================== 模块 3：灵敏度和特异性 ====================

A test has two virtues and two vices.

Sensitivity asks: Can it find the sick?

Specificity asks: Can it spare the healthy?

Sensitivity

“在所有病人中，检测发现了多少人？”

THE FORMULA

Sensitivity = TP / (TP + FN)

True Positives divided by All Diseased

IN SIMPLE WORDS

如果 100 人患有癌症，而测试发现了其中 90 人， sensitivity = 90%.

High sensitivity = few false negatives = few missed cases.

Specificity

“在所有健康人中，有多少人被测试幸免了？”

THE FORMULA

Specificity = TN / (TN + FP)

True Negatives divided by All Non-Diseased

IN SIMPLE WORDS

如果 100 人都是健康的，并且其中 95 人的检测结果正确为“阴性”， specificity = 95%.

High specificity = few false positives = few false alarms.

幻灯片 3.4：权衡

残酷的权衡

THE DILEMMA

A test cannot have both perfect sensitivity AND perfect specificity.

Lower the threshold to catch more sick people? You'll alarm more healthy people.

Raise the threshold to spare healthy people? You'll miss more sick people.

This is the threshold effect——诊断的跷跷板。

幻灯片 3.5：SnNout 和 SpPin

记忆法则

SnNout: Sensitive tests rule OUT

A highly sensitive test, when negative, rules out disease. If it didn't find it, it's probably not there.

SpPin: Specific tests rule IN

高度特异性的测试如果呈阳性，则可诊断疾病。如果它说你拥有它，那么你可能就拥有了。

REMEMBER

SnNout: Sensitive Negative rules OUT
SpPin: Specific Positive rules IN

“敏感会传染疾病。
特异性可以避免井井有条。
But no test masters both perfectly—
这是我们必须承担的重担。”

==================== 第 4 单元：新冠病毒检测丑闻 ====================

2020年，当瘟疫席卷大地时，
世界需要一项测试 快速找到感染者.

But what if the rapid test missed too many?

幻灯片 4.2：承诺

快速测试的前景

15 min

Results Time

Cost per Test

Home

Self-Administered

THE HOPE

Rapid antigen tests could be used at home, at work, before gatherings. They could stop outbreaks before they started.

幻灯片 4.3：现实

The Reality

COCHRANE 系统评价，2022 年

当研究人员汇总了 155 项有关 COVID-19 快速抗原检测的研究数据时：

对于有症状的人：
Sensitivity: 73% (missed 27% of cases)

In people WITHOUT symptoms:
Sensitivity: 55% (missed 45% of cases)

近一半的无症状感染者被告知自己已经痊愈。

Dinnes J et al. Cochrane Database Syst Rev. 2022;7:CD013705

幻灯片 4.4：后果

假阴性的传播

Thanksgiving Dinners

Families tested negative in the morning, gathered indoors, unknowingly infected grandparents

Workplace Outbreaks

Workers tested negative, came to work, infected colleagues in the break room

Hospital Transmission

Patients tested negative, admitted to wards, infected vulnerable patients

THE LESSON

A 55% sensitivity means 45% false negative rate. False negatives are invisible. They feel safe. They spread disease.

“测试结果显示‘阴性’，
一家人齐聚一堂，
祖父拥抱了他的孙子们，
冬天结束时，他就走了。”

==================== 模块 5：似然比 ====================

敏感性和特异性描述了该测试。

但病人问了一个不同的问题：

"I tested positive. What are my chances?"

幻灯片 5.2：Sens/Spec 的问题

The Disconnect

A DOCTOR'S DILEMMA

A test has 99% sensitivity and 95% specificity. Sounds excellent.

您的患者的一种罕见疾病检测呈阳性（患病率为千分之一）。

Question: 他们实际上患有这种疾病的概率是多少？

大多数医生说95%。真正的答案？ About 2%.

Likelihood Ratios

How much a test result changes the odds

POSITIVE LIKELIHOOD RATIO

LR+ = Sensitivity / (1 - Specificity)

How much more likely is a positive result in sick vs healthy?

NEGATIVE LIKELIHOOD RATIO

LR- = (1 - Sensitivity) / Specificity

How much more likely is a negative result in sick vs healthy?

What Good LRs Look Like

LR+ > 10

Strong rule-in

LR+ 5-10

Moderate rule-in

LR+ 2-5

Weak evidence

LR- < 0.1

Strong rule-out

LR- 0.1-0.2

Moderate rule-out

LR- 0.2-0.5

Weak evidence

"Sensitivity tells how many sick the test will catch.
Specificity tells how many well it will spare.
But only the likelihood ratio answers:
什么这个结果对这位患者意味着什么吗？"

==================== 模块 6：乳房 X 线摄影争议 ====================

你没听说过放映的争议吗
that found too much?

When does finding disease become causing harm?

幻灯片 6.2：筛选的前景

早期发现的梦想

1970s - PRESENT

这个逻辑很美好：及早发现癌症，及早治疗，拯救生命。

Mammography could detect tumors too small to feel.

妇女们被告知： "Annual mammograms save lives."

But what if some of those "cancers" would never have killed?

过度诊断的诅咒

19-30%

of screen-detected breast cancers may be overdiagnosed

WHAT IS OVERDIAGNOSIS?

找到一种“癌症” never have caused symptoms or death 在人的一生中。

这名妇女被诊断出来，接受了手术、放疗、化疗治疗—— 为了一种永远不会伤害她的疾病。

Independent UK Panel on Breast Cancer Screening. Lancet. 2012;380:1778-1786

幻灯片 6.4：数字

每 10,000 名接受筛查的女性

3-4

Lives saved
来自乳腺癌

~15

Overdiagnosed
(treated unnecessarily)

~500

False alarms
(anxiety, biopsies)

THE TRADE-OFF

To save 3-4 lives，你必须接受这一点 ~15 women 将接受永远不会伤害他们的癌症治疗，并且 ~500 women will experience false alarms.

这是一笔好交易吗？ 答案取决于价值观，而不仅仅是数字。

“测试发现了隐藏的东西，
并称之为疾病，
那女人被割伤、被烧伤、被毒死——
为了一个永远不会让她的日子变得黑暗的阴影。”

这就是过度诊断的问题。

==================== 模块 7：DTA 的元分析 ====================

一项研究可能会撒谎。一项研究可能会让人奉承。

但是当您收集 所有的研究,
当你权衡他们的证据时——

The truth becomes harder to hide.

为什么要汇集证据？

More Precision

Combining studies gives narrower confidence intervals, reducing uncertainty

Detect Heterogeneity

Why do different studies give different answers? Setting? Population? Threshold?

Expose Publication Bias

负面研究是否被隐藏？漏斗图揭示了不对称性

Explore Thresholds

Build SROC curves to understand the sensitivity-specificity trade-off

幻灯片 7.3：双变量模型

双变量模型

WHY IT'S SPECIAL

您不能单独合并敏感性和特异性。

They are correlated：当一个上升时，另一个趋于下降（阈值效应）。

The bivariate model 解释了这种相关性，给出了有效的汇总估计。

Reitsma JB et al. J Clin Epidemiol. 2005;58:982-990

ROC 曲线总结

ROC Space

Perfect test 1-Specificity (FPR) →

每个点 = 一项研究
曲线显示了权衡
Higher = better test

↓ Sensitivity Useless test (chance line)

阅读 SROC

Top-left corner = perfect test (100% sens, 100% spec)
Diagonal line = useless test (random guessing)
The curve = 所有研究表现的总结

“一项研究可能会欺骗人。多项研究综合考虑后，
开始揭露真相。
SROC 曲线是证据路径——
showing what the test can truly do."

但是如果研究 disagree?

一项研究表明敏感性为 95%。
Another says 60%.

你相信哪个真理？

异质性：分歧

DEFINITION

Heterogeneity 是研究之间的差异，不能仅用偶然来解释。

High heterogeneity means 这些研究正在测量不同的事物— or the test performs differently in different settings.

为什么研究不同意

Threshold Differences

“阳性”结果的不同截止值（例如，糖尿病的不同 HbA1c 阈值）

Population Differences

Disease severity, age, comorbidities differ between studies

Setting Differences

Primary care vs. specialist clinic vs. emergency room

Quality Differences

Risk of bias, verification bias, spectrum bias

衡量分歧

I² < 25%

Low heterogeneity
Studies agree

I² 25-75%

Moderate
Some disagreement

I² > 75%

High heterogeneity
Major disagreement

THE WARNING

When I² > 75%, the pooled estimate may be meaningless.

你不能平均苹果和橙子。你必须 explain why studies differ before pooling them.

“当研究存在分歧时，
不要压制异议。
Ask: Why do they see differently?
分歧本身就说明了一切。”

==================== 模块 9：DTA 工具包 ====================

您的 DTA 工具包

基本措施以及何时使用它们

幻灯片 9.2：措施

Essential Metrics

Sensitivity & Specificity

How well the test performs on sick vs. healthy people

Likelihood Ratios (LR+, LR-)

How much a result changes the probability of disease

Diagnostic Odds Ratio (DOR)

Single measure of test discrimination (DOR = LR+ / LR-)

SROC 曲线下面积 (AUC)

Overall test performance across all thresholds (0.5 = useless, 1.0 = perfect)

幻灯片 9.3：工具和软件

行业工具

mada

R package for
bivariate meta-analysis

metaDTA

Stata module
DTA 审查

DTA Pro

Browser-based
开放获取工具

标准参考

Reitsma et al. 2005 - Bivariate model
Rutter & Gatsonis 2001 - HSROC model
Cochrane Handbook Ch. 10 - DTA methods

幻灯片 9.4：清单

在您信任 DTA 研究之前

✓

Was there a valid reference standard?

Gold standard test applied to all patients?

✓

口译员是否被蒙蔽了？

Test readers unaware of diagnosis, and vice versa?

✓

频谱是合适吗？

Patients similar to your clinical population?

✓

阈值是否预先指定？

或者是为了最大化结果而选择的？

"Armed with sensitivity, specificity, likelihood,
配备了 SROC 和一致性度量，
您可以通过测试的谎言 -
并自行判断其真实性。”

==================== 模块 10：现实世界的决策故事 ====================

========== 故事 1：Theranos Mirage ==========

Have you considered what happens when a test promises everything?

When a machine claims to see what no other machine can see,
没有人问： “给我看看证据”?

幻灯片 10.2：真实数据框

希拉洛斯幻影号

REAL DATA: FDA INSPECTION FINDINGS

Theranos claimed: 200+ tests from a single finger prick

FDA found:
• Results varied by 146% between runs on the same sample
• Edison machines failed 87% of proficiency tests
• Zero 发表同行评审的验证研究
• 患者的样本呈阴性，结果却呈 HIV 阳性

Sources: FDA Warning Letter 2016; Carreyrou J. Bad Blood. 2018; CMS Inspection Reports.

决策树

你是一名医院管理员。 Theranos 为您提供一份合同。

你选择什么？

Theranos Contract Offered

↓

PATH A: Trust Marketing Claims

↓

签订合同

↓

Misdiagnose thousands
Face lawsuits
Harm patients

路径 B：需求验证数据

↓

没有发现同行评审的研究

↓

拒绝合同
保护您的患者
Avoid Scandal

幻灯片 10.4：启示

THE REVELATION

Elizabeth Holmes never validated a single test externally.

A $9 billion valuation became a criminal fraud conviction.

每家医院在签署前都要求验证数据
受到保护，免受谎言的侵害。

每家信任营销的医院
became complicit in harming patients.

THE LESSON

No peer-reviewed validation = No trust.
缺乏证据并不是营销问题。
It is a patient safety emergency.

========== 故事 2：新冠病毒快速测试混乱 ==========

When speed defeats accuracy,
谁付出代价？

The test result comes in 15 minutes.
但如果结果是 15 minutes of false confidence?

幻灯片 10.6：真实数据框

The COVID Rapid Test Chaos

真实数据：制造商与供应商现实

Manufacturer claims: 97% sensitivity

Real-world performance (Cochrane 2022):
• Symptomatic individuals: 73% sensitivity (missed 27%)
• Asymptomatic individuals: 58% sensitivity (missed 42%)
• Early infection (days 0-3): ~50% sensitivity

近一半的无症状感染者被告知他们“已康复”。

Source: Dinnes J et al. Cochrane Database Syst Rev. 2022;7:CD013705

决策树

您是新冠疫情期间的一名学校护士。一名有症状的教师快速检测结果呈阴性。

你选择什么？

Symptomatic Teacher + Negative Rapid Test

↓

PATH A: Trust the Negative Result

↓

Send Teacher to Class

↓

Outbreak infects 30 students
School closure
Three hospitalizations

PATH B: Remember Sensitivity Limits

↓

Require PCR Confirmation

↓

PCR positive confirmed
Teacher isolates
Outbreak prevented

幻灯片 10.8：启示

THE REVELATION

A negative test 并不意味着 no infection.

It means: "not detected."

这两个短语的区别
is measured in lives.

THE LESSON

When sensitivity is 58%, a negative result in a symptomatic person
is almost meaningless.

SnNout only works when sensitivity is HIGH.
Know your test's limits before trusting its verdict.

========== 故事 3：乳房 X 光检查争议 ==========

可以测试一下吗 finds cancer
still cause harm?

如果发现癌症怎么办
would never have hurt you?

幻灯片 10.10：真实数据框

乳房X光检查的争议

真实数据：筛选悖论

Test characteristics:
Sensitivity: ~85% | Specificity: ~90%

10 年来每年对 1,000 名女性进行筛查：
• 1 death prevented 来自乳腺癌
• 5 women overtreated 对于永远不会伤害他们的癌症
• 100-500 false alarms leading to biopsies, anxiety, repeat imaging

Overdiagnosis rate: 19-30% of screen-detected cancers

Source: Independent UK Panel on Breast Cancer Screening. Lancet. 2012;380:1778-1786

决策树

A 45-year-old woman asks your advice on mammography screening.

你选择什么？

Patient Asks: "Should I Get Annual Mammograms?"

↓

PATH A: Recommend Without Context

↓

Small Tumor Found

↓

Mastectomy performed
肿瘤呈惰性（DCIS）
Would never have harmed her

PATH B: Explain Trade-offs

↓

Shared Decision-Making

↓

Patient makes informed choice
了解好处和坏处
Autonomy preserved

幻灯片 10.12：启示录

THE REVELATION

测试准确性并不是唯一的衡量标准。

A test can be accurate 并且仍然导致 harm.

When overdiagnosis exceeds lives saved,
we must ask: Is finding always helping?

THE LESSON

带来的危害来自 误报和过度诊断
可以超过受益 true positives.

Always weigh benefits against harms.
筛查并不总是可以节省开支。

========== 故事 4：PSA 悖论 ==========

What if finding the disease
is worse than missing it?

What if the treatment causes more suffering
than the disease ever would?

幻灯片 10.14：真实数据框

PSA悖论

真实数据：阈值陷阱

PSA cutoff of 4.0 ng/mL:
• 对高级癌症的敏感性： 21%
• Detects many indolent cancers that would never harm

Lower cutoff to 2.5 ng/mL:
• Sensitivity rises to: 40%
• But overdiagnosis doubles

Treatment consequences:
• 20-30% of men experience incontinence after prostatectomy
• 30-70% experience erectile dysfunction

Source: US Preventive Services Task Force. JAMA. 2018;319(18):1901-1913

决策树

您正在为卫生系统制定 PSA 筛查政策。摆在你面前的是三条路。

您选择什么阈值？

PSA Screening Policy Decision

↓

PATH A: Low Threshold (2.5)

↓

Maximize sensitivity
Thousands of unnecessary
活检和治疗

PATH B: High Threshold (4.0)

↓

Miss some cancers
But most missed are indolent
Fewer unnecessary treatments

PATH C: Abandon PSA Screening

↓

Miss aggressive cancers
Some preventable deaths
No overtreatment harm

幻灯片 10.16：启示录

THE REVELATION

There is no "right" cutoff.

Every threshold trades 特异性敏感性,
检测过度诊断.

选择不是医疗的。这是 ethical.
这取决于你愿意接受什么伤害。

THE LESSON

阈值效应不是技术问题。
It is a values problem.

Before choosing a cutoff, ask:
What is worse: missing disease or overtreating the healthy?

========== 故事 5：两种人群的结核病检测 ==========

同样的测试。一样的截断。
Different truths.

How can identical numbers
mean opposite things?

幻灯片 10.18：真实数据框

两种人群的结核病检测

REAL DATA: SAME TEST, DIFFERENT MEANING

Tuberculin Skin Test (10mm cutoff)
Sensitivity: ~80% | Specificity: ~95%

In high-prevalence setting (TB prevalence 10%):
• Positive Predictive Value: 85%
• A positive test usually means TB

In low-prevalence setting (TB prevalence 0.1%):
• Positive Predictive Value: 15%
• A positive test is usually a false positive

Source: Pai M et al. Lancet Infect Dis. 2014;14(8):765-773

决策树

您是伦敦的一名医生，为来自结核病高发国家的移民看诊。 TST 呈阳性。

你的结论是什么？

Positive TST in Immigrant Patient

↓

PATH A: Apply UK Population PPV

↓

Assume "Probably False Positive"

↓

Miss active TB
Patient infects family
诊断延误数月

PATH B: Consider Patient's Pre-Test Probability

↓

Recognize PPV Depends on Prevalence

↓

Investigate further
Chest X-ray, sputum
Treat early if confirmed

幻灯片 10.20：启示录

THE REVELATION

敏感性和特异性 are properties of the test.

PPV 和 NPV are properties of the population.

相同的结果意味着 different things
in different people.

THE LESSON

Never interpret a test result without knowing the pre-test probability.

A positive test in a high-risk patient means disease.
The same positive in a low-risk patient means probably nothing.

Context is everything.

幻灯片 10.21：模块摘要

Five Stories, Five Lessons

Theranos: Demand Validation

No peer-reviewed data = no trust, regardless of marketing claims

COVID Rapid Tests: Know Sensitivity Limits

“未检测到”与“未感染”不同

Mammography: Weigh Benefits vs. Harms

Finding is not always helping; overdiagnosis causes real harm

PSA: The Threshold is a Values Choice

每个截断值都以敏感性换取特异性；没有“正确”的答案

TB Test: Context Determines Meaning

The same result means different things in different populations

==================== 第 11 单元：参考资料和测验 ====================

References

本课程引用的主要来源

Carreyrou J. Bad Blood: Secrets and Lies in a Silicon Valley Startup. Knopf, 2018.
Dinnes J, et al. Rapid, point-of-care antigen tests for diagnosis of SARS-CoV-2 infection. Cochrane Database Syst Rev. 2022;7:CD013705.
英国乳腺癌筛查独立小组。乳腺癌筛查的好处和坏处。 Lancet. 2012;380:1778-1786.
Reitsma JB 等人。敏感性和特异性的双变量分析在诊断评价中产生信息丰富的总结措施。 J Clin Epidemiol. 2005;58:982-990.
Rutter CM, Gatsonis CA. A hierarchical regression approach to meta-analysis of diagnostic test accuracy evaluations. Stat Med. 2001;20:2865-2884.
Deeks JJ, et al. The performance of tests of publication bias in systematic reviews of diagnostic test accuracy. J Clin Epidemiol. 2005;58:882-893.
Macaskill P, et al. Cochrane Handbook for Systematic Reviews of Diagnostic Test Accuracy. Chapter 10. 2023.
Higgins JPT, Thompson SG. Quantifying heterogeneity in a meta-analysis. Stat Med. 2002;21:1539-1558.
US Food and Drug Administration. Warning Letter to Theranos Inc. 2016.
US Preventive Services Task Force. Screening for Prostate Cancer. JAMA. 2018;319(18):1901-1913.
Pai M, et al. Tuberculosis. Lancet Infect Dis. 2014;14(8):765-773.

在希拉洛斯的故事中，他们的血液检测的根本问题是什么？

它们太贵

给出的错误结果可能会伤害患者

速度太慢

They required too much blood

A COVID rapid test has 55% sensitivity in asymptomatic people. What does this mean?

55% 的阳性结果是正确的

测试的成功率为 55%

The test misses 45% of infected asymptomatic people

55% 的阴性结果是正确的错误

What does "SnNout" mean?

A highly Sensitive test, when Negative, rules OUT disease

A highly Specific test, when Negative, rules OUT disease

A sensitive test should be used for screening

敏感性和特异性是负相关的

为什么不能在荟萃分析中将敏感性和特异性分开？

They use different scales

由于阈值效应，它们是相关的

They come from different populations

It's computationally too difficult

✔

Course Complete

“现在你知道了四种结果，
测试的两个优点，
残酷的权衡阈值，
以及汇集证据的艺术。

当下一个测试对你不利时——
you will know how to see through it."

测试何时谎言——现在您知道了。

1 / 5