当测试谎言时：终极 DTA 课程 (V4)

你没听过这个故事吗？女人
who promised to 用一滴血改变世界,
who raised billions on a test that never worked?

Palo Alto, 2003

STANFORD UNIVERSITY

一名十九岁的女孩怀着一个愿景辍学：用一滴血进行数百次血液测试。

Investors believed. Walgreens believed. The Pentagon believed.

They gave her $9 billion.

但测试给出了错误的结果。患者被告知他们感染了艾滋病毒，但实际上他们并没有感染。当患者 dying.

Carreyrou J. Bad Blood. 2018

欺骗决策树

What Theranos Did vs. What Should Happen

New Diagnostic Test

↓

SHOULD DO

Validate Against Gold Standard

↓

Publish TP/FP/FN/TN

↓

FDA Approval

THERANOS DID

Skip Validation

↓

Hide Failures

↓

Harm Patients

时，患者被告知他们的血液是正常的，并且测试撒了谎，
并且谎言是确定无疑的，
并且没有人要求 2×2桌子。”

这就是我们研究诊断测试准确性的原因。

When a test speaks,
只有 four possible truths.

两个是祝福。其中两个是诅咒。

当系统评价平等地信任每项研究时会发生什么？

REAL DATA

DTA 系统评价中的敏感性分析一致表明，排除高偏倚风险研究会改变汇总估计值。在乳房X光检查筛查中，采用非盲解释的病例对照设计往往会提高敏感性。一般原则有据可查：QUADAS-2 质量评估可以在消除有偏见的研究后将汇总敏感性改变 10-15 percentage points 。

QUADAS-2 乳房 X 光检查审计

审查小组汇集了 15 项乳房 X 光 DTA 研究。由于病例对照设计和非盲解释，其中五个结果存在较高偏倚风险。

路径 A：汇集所有研究

Include all 15 studies regardless of quality

↓

Biased 2x2 tables inflate TP counts, producing a pooled sensitivity of 87%

OUTCOME: Overconfidence in screening accuracy

PATH B: Apply Quality Assessment

Exclude high risk-of-bias studies using QUADAS-2

↓

Remaining 10 low-RoB studies yield sensitivity of approximately 75%

OUTCOME: Honest numbers guide honest decisions

THE REVELATION

四种结果（TP、FP、FN、TN）只有在生成它们的研究值得信赖的情况下才值得信赖。有偏见的研究会污染整个 2x2 表。

结果树

Every Test Result Has a Reality Behind It

Patient Tested

↓

真相是什么？

Has Disease

D+

↓

TPTest +

FNTest -

No Disease

D-

↓

FPTest +

TNTest -

神圣的 2×2 桌子

HIV Rapid Test Example (Real Data)

	HIV+	HIV-	Total
Test +	98	3	101
Test -	2	895	897
Total	100	898	998

从此表中得出所有真相

Sensitivity = 98/100 = 98%
Specificity = 895/898 = 99.7%

"Two outcomes save. Two outcomes harm.
TP， TN：测试说的是真。
FP、FN：测试说谎了。
Know them by name, for they determine fate."

你没听说过那条血吗？进行了测试，
found clean,
并给予数千人——
while death swam within it?

血液供应危机，1985年

UNITED STATES

When HIV testing began, doctors celebrated: they could now screen the blood supply.

但是测试发生了 window period——感染后几周，病毒存在，但对 undetectable.

血液进行了测试。血液呈“阴性”。输血了。

8,000-12,000 Americans 在更好的测试关闭窗口之前通过输血被感染。

CDC. MMWR. 1987;36(49):833-840

The Window Period Decision Tree

Why False Negatives Are Deadly

Person Recently Infected

↓

Time Since Infection?

< 2 weeks

Test NEGATIVEVirus present!

↓

Blood DonatedOthers infected

> 4 weeks

Test POSITIVECorrectly detected

↓

Blood DiscardedSupply safe

敏感性随时间变化

0%

Day 1-7
Eclipse period

~50%

Day 14
Seroconversion

~95%

Day 21
Most detected

99.9%

Day 45+
Window closed

THE LESSON

敏感性不固定。 It depends on when you test. A "99% sensitive" test may be 0% sensitive in early infection.

”测试说“干净”，
因为病毒还没有露面。
血液被共享，
感染传播到了无辜者。”

您没有听说过给母亲服用的药丸
to protect their pregnancies,
that planted cancer in their daughters
twenty years before it bloomed?

DES 悲剧，1938-1971

UNITED STATES & EUROPE

Diethylstilbestrol (DES) was given to millions of pregnant women to prevent miscarriage.

No proper clinical trial was ever conducted. Doctors assumed it worked because it seemed reasonable.

Decades later, their daughters developed a rare cancer: clear cell adenocarcinoma of the vagina. A cancer so rare it was a diagnostic signal in itself.

5-10 million women 的危害已经暴露出来。

Herbst AL et al. N Engl J Med. 1971;284:878-881

验证决策树

What Should Have Happened

New Medical Intervention

↓

是否经过了正确测试？

YES

Randomized Trial

↓

Long-term Follow-up

↓

Know True Effects好处和危害

NO (DES)

Assumption Only

↓

Widespread Use

↓

Hidden HarmDiscovered too late

诊断信号

稀有性成为证据

阴道透明细胞腺癌在年轻女性中非常罕见，以至于 7 cases in one hospital triggered an investigation.

簇本身就是诊断信号测试：
Sensitivity to DES exposure: nearly 100%
如果您在这个年龄患有这种癌症，那么您几乎肯定已经暴露了。

1:1000

Risk of clear cell
cancer in DES daughters

5-10M

Women exposed
worldwide

“母亲们满怀希望地服用了避孕药，
女儿们在阴影中成长，
二十年后，癌症绽放—
a diagnosis that indicted a generation of medicine."

A test has two virtues and two vices.

Sensitivity：它能找到病人吗？

Specificity：它能保护健康人吗？

在现实世界中使用测试时，您能相信实验室提供的敏感度数字吗？

REAL DATA

The BinaxNOW COVID-19 rapid antigen test reported sensitivity of approximately 84-97% in symptomatic individuals in manufacturer studies. However, real-world evaluations found sensitivity as low as 35-64% 在无症状个体中，取决于病毒载量和时间。 Cochrane 快速抗原检测综述（Dinnes 2022）证实，在 100 多项研究评估中，有症状人群的平均敏感性为 73% ，无症状人群的平均敏感性仅为 55% 。

The COVID Rapid Test Paradox: 2020-2021

A university plans to screen asymptomatic students weekly before allowing campus access. They read the manufacturer's claim of high sensitivity.

PATH A: Trust Lab Sensitivity

Rely on manufacturer's high sensitivity figure

↓

病毒载量低的无症状携带者检测呈阴性并上课，传播了病毒

OUTCOME: False sense of safety; campus outbreaks

路径B：需求真实世界数据

在实际目标人群（无症状学生）中寻求研究

↓

Discover sensitivity is roughly 55% in asymptomatic people; add serial testing and other safeguards

OUTCOME: Layered safety catches more cases

THE REVELATION

敏感性不是测试的固定属性。它随着人口、疾病阶段和环境而变化。总是询问： whom?

灵敏度：猎人

THE FORMULA

Sensitivity = TP / (TP + FN)

"Of all the sick, how many did we catch?"

Worked Example: COVID PCR Test

Given: 200 infected patients tested

TP = 196 (correctly positive), FN = 4 (missed)

Sensitivity = 196 / (196 + 4) = 196/200 = 98%

Interpretation: Test catches 98 of every 100 infected people

特异性：守护者

THE FORMULA

Specificity = TN / (TN + FP)

"Of all the healthy, how many did we spare?"

Worked Example: Same COVID PCR Test

Given: 1000 uninfected people tested

TN = 999 (correctly negative), FP = 1 (false alarm)

Specificity = 999 / (999 + 1) = 999/1000 = 99.9%

Interpretation: Test correctly clears 999 of every 1000 healthy people

记忆法则

When to Use Which Test

你需要什么？

RULE OUT disease

Use HIGH SENSITIVITY

↓

SnNoutSensitive Negative = OUT

RULE IN disease

Use HIGH SPECIFICITY

↓

SpPinSpecific Positive = IN

“敏感会传染疾病。
特异性可以避免井井有条。
But no test masters both perfectly—
这是我们所承受的负担。”

你没见过医生吗
who saw 99% accurate
and believed a positive result meant 99% certainty?

这是医学上最致命的错误。

基本利率谬误

THE PUZZLE

A disease affects 1 in 1000 people.
测试的敏感性为 99%，特异性为 99%。
A patient tests positive.

他们患有这种疾病的概率是多少？

Most doctors say ~99%. 真正的答案大约是9%。

数学揭晓

Testing 100,000 People (Prevalence 1/1000)

Step 1: 100 have disease, 99,900 healthy

Step 2: Of 100 sick: 99 test positive (TP), 1 negative (FN)

Step 3: Of 99,900 healthy: 999 test positive (FP), 98,901 negative (TN)

Step 4: Total positives = 99 + 999 = 1,098

PPV = TP / All Positives = 99 / 1,098 = 9%

91% 的阳性结果是假阳性！

Interactive Base Rate Calculator

See How Prevalence Changes PPV

Prevalence:

1%

Sensitivity:

99%

Specificity:

99%

9%

Positive Predictive Value (PPV)

91% 的阳性结果是误报

流行率决策树

Same Test, Different Settings

Test: 99% Sens, 99% Spec

↓

Where Is Testing Done?

General Pop
0.1%

PPV = 9%91% false +

High-Risk
10%

PPV = 92%8% false +

Confirmatory
50%

PPV = 99%1% false +

“医生说‘99%准确’，
病人听到“99%确定”
两人都被骗了——
因为他们忘了问：这种疾病有多罕见？”

您有没有听说过被称为
that could find TB in two hours,
但错过了 revolutionary—
GeneXpert 故事的机器 drug-resistant strains?

，南方非洲

CAPE TOWN, 2010

一个世纪以来，结核病诊断需要培养细菌数周。然后是 GeneXpert： 2 hours.

South Africa deployed it nationwide. The WHO endorsed it.

的结果，但在 low bacterial loads—often HIV co-infected— sensitivity dropped to 67%. One in three cases missed.

患者中，为了检测利福平耐药性，它错过了 5% 耐药病例。这些患者接受了错误的治疗。耐药结核病传播。

Steingart KR et al. Cochrane Database Syst Rev. 2014;1:CD009593

TB Diagnosis Decision Tree

当 GeneXpert 不够时

Suspected TB Patient

↓

GeneXpert Test

↓

Positive

↓

Rifampicin?

SensitiveStandard Tx

ResistantMDR-TB Tx

Negative

↓

HIV+ or High Suspicion?

YesCulture needed

NoLikely negative

Sensitivity by Patient Type

98%

Smear-positive
(high bacterial load)

67%

Smear-negative
(low bacterial load)

61%

HIV co-infected
(immune suppressed)

THE LESSON

临床试验中的测试敏感性可能与您的患者的敏感性不匹配。 了解您的人群。

”机器说“阴性”，
医生相信了机器，
病人带着肺结核回家了，
咳嗽阻力进入了世界。”

你没听说过男性测试吗
发现了癌症 never kill,
并导致治疗 destroyed lives?

PSA 筛查悲剧

UNITED STATES, 1990s-2010s

PSA (Prostate-Specific Antigen) could detect prostate cancer early.

医生对数百万男性进行了筛查。发现了癌症。前列腺被切除。

但其中许多“癌症”永远不会引起症状。手术造成 阳痿和失禁 in men who would have died of old age, not cancer.

Moyer VA. Ann Intern Med. 2012;157:120-134

PSA筛查困境的敏感性：2012

一位 60 岁的男性向他的医生询问有关 PSA 筛查的问题。 PSA 在 4.0 ng/mL 临界值时对高级别癌症的敏感性约为 21%，但可检测到许多惰性癌症。

PATH A: Screen All Men

对 50 岁以上的所有男性进行常规 PSA 筛查

↓

13 年来每 1,000 名筛查者：预防了 1-2 例死亡，但误报了 100 多次30-40 名男性因治疗惰性癌症而导致阳痿或失禁

OUTCOME: Net harm exceeds benefit at population level

PATH B: Shared Decision-Making

讨论危害与益处；根据风险因素、预期寿命和患者价值观进行个体化

↓

High-risk men can choose screening; low-risk men can decline; active surveillance replaces immediate surgery for low-grade findings

OUTCOME: Fewer unnecessary treatments; patient autonomy preserved

THE REVELATION

高检出率的测试在发现不需要发现的情况时可能弊大于利。过度诊断是惰性疾病高敏感性的隐性成本。

伤害的数字

1

生命被拯救
prostate cancer
per 1000 screened

30-40

Men made impotent
or incontinent
per 1000 screened

100+

False positives
(biopsies, anxiety)
per 1000 screened

THE REVERSAL

In 2012, the US Preventive Services Task Force recommended against 常规 PSA 筛查。测试发现了太多不需要发现的东西。

Patient Decision Aid: PSA Screening

如果对 1,000 名 55-69 岁的男性进行 13 年筛查

Deaths from prostate cancer prevented

1-2 men

Men who will have false positive requiring biopsy

100-120 men

被诊断患有永远不会伤害他们的癌症的男性

20-50 men

Men left impotent or incontinent from treatment

30-40 men

您可以接受这种权衡吗？

“测试发现了影子，
然后外科医生切开，
那个人还活着——无能、大小便失禁——
患有永远不会醒来的癌症。”

您没有听说过那个有胸部的男人吗疼痛
其第一个肌钙蛋白是 normal,
被送回家-
并在早上之前死亡？

肌钙蛋白计时问题

EMERGENCY DEPARTMENTS WORLDWIDE

肌钙蛋白是心脏病诊断的金标准。但需要 3-6 hours to rise after myocardial injury.

A patient arrives one hour after chest pain begins. Troponin is tested: normal. "You're fine. Go home."

心脏快要死了。蛋白质还没有泄漏。

Studies show 2-5% of MI patients sent home from ED die within 30 days.

Pope JH et al. N Engl J Med. 2000;342:1163-1170

Serial Testing Decision Tree

二肌钙蛋白协议

Chest Pain Patient

↓

First Troponin

↓

Elevated

↓

Treat as MI

Normal

↓

When Did Pain Start?

<6 hrs

Wait 3 hrsRepeat troponin

>6 hrs

Low riskConsider d/c

High-Sensitivity Troponin

~70%

Conventional troponin
sensitivity at 0 hrs

~95%

hs-Troponin
sensitivity at 0 hrs

99%

hs-Troponin
at 3 hrs serial

THE TRADE-OFF

High-sensitivity troponin catches more heart attacks early. But it also has more false positives—elevated in kidney disease, heart failure, sepsis, and marathon runners.

“测试结果显示‘正常’，
因为心脏刚刚开始死亡。
病人是放心，
and went home to finish dying."

灵敏度描述了测试。
特异性描述了测试。

但病人问：
"I tested positive. What are MY chances?"

如果已发表的测试敏感性高于真实情况，并且您计算的似然比因此错误怎么办？

REAL DATA

Rapid strep tests (RADT) showed pooled sensitivity of approximately 86% 在 Cochrane 综述中已发表的研究中。然而，FDA 510(k) 监管提交文件（其中包括未发布的制造商数据）显示的敏感性估计仅为 70-75%。已发表的敏感性较高的研究更有可能提交发表——典型的发表偏倚案例，夸大了表观准确性。

The Rapid Strep Test Publication Gap

临床医生根据已发表的数据计算 LR+（敏感性 86%，特异性 95%），以决定是否治疗儿童的喉咙痛。但真正的灵敏度可能只有 70%。

PATH A: Trust Published Meta-Analysis

使用已发表数据中的 LR+ (86/5 = 17.2)

↓

高估 LR+ 会导致对阴性结果过于自信；患有链球菌的儿童在不使用抗生素的情况下被送回家

OUTCOME: Missed strep leads to rheumatic fever risk

路径 B：寻求监管数据

使用 FDA 提交文件中的 LR+ (70/5 = 14)，并注意 LR- 更差（0.32 vs 0.15）

↓

Recognize a negative RADT cannot confidently exclude strep; back up with throat culture when clinical suspicion is high

OUTCOME: Appropriate caution protects children

THE REVELATION

似然比的真实性取决于产生它们的敏感性和特异性。发表偏见会抬高准确性，使 LR+ 过于乐观，而 LR- 过于令人安心。总是问：是否遗漏了未发表的研究？

Likelihood Ratios

POSITIVE LIKELIHOOD RATIO

LR+ = Sensitivity / (1 - Specificity)

How much more likely is a + result in sick vs healthy?

NEGATIVE LIKELIHOOD RATIO

LR- = (1 - Sensitivity) / Specificity

How much more likely is a - result in sick vs healthy?

费根列线图

从测试前到测试后的概率

Pre-Test
Probability

99%

50%

20%

5%

1%

Likelihood
Ratio

100

10

1

0.1

0.01

Post-Test
Probability

99%

80%

50%

20%

1%

Draw a line from pre-test through LR to find post-test probability

Interpreting Likelihood Ratios

这个测试有多强大？

LR+ Value?

LR+ > 10Strong rule-in

5-10Moderate

2-5Weak

1-2Useless

LR- Value?

< 0.1Strong rule-out

0.1-0.2Moderate

0.2-0.5Weak

0.5-1Useless

“灵敏度告诉我们有病。
特异性告诉我们健康。
But the likelihood ratio answers:
什么这个结果对这位患者意味着什么吗？"

您没见过村里发烧的孩子吗，
快速检测说 negative,
and the Plasmodium 不断繁殖？

疟疾RDT问题

SUB-SAHARAN AFRICA

Malaria kills 600,000 people yearly, mostly children under 5.

Rapid Diagnostic Tests were meant to guide treatment in remote areas without microscopes or laboratories.

But when parasitemia is low—RDT漏掉病例. And when P. falciparum 删除HRP2基因— the RDT sees nothing at all.

WHO. Malaria RDT Performance. 2022

临床决策树

Child with Fever in Malaria-Endemic Area

Febrile Child

↓

Perform RDT

↓

RDT Positive

↓

治疗疟疾

RDT Negative

↓

Clinical Suspicion?

High

Treat Anywayor Microscopy

Low

Look forOther Cause

Sensitivity Varies by Parasitemia

95%

High parasitemia
(>200/μL)

75%

Low parasitemia
(100-200/μL)

50%

Very low
(<100/μL)

临床教训

A negative RDT does not rule out malaria in endemic areas. Clinical judgment must override the test when suspicion is high.

“测试结果显示‘阴性’，
孩子被送回家，
寄生虫在体内繁殖。天黑了，
到了早上，孩子就醒不过来了。”

在瘟疫肆虐的那一年，
世界需要一个测试 fast.

但快速与 accurate.

当新一代检测方法具有更高的灵敏度时，是否会自动使其变得更好？

REAL DATA

高灵敏度肌钙蛋白 (hs-cTn) 检测将急性心肌梗死的灵敏度从大约 70% （就诊时的传统肌钙蛋白）提高到超过 95%. But specificity dropped from approximately 95% to around 80% 因为hs-cTn可检测许多非MI原因（心力衰竭、脓毒症、肾脏疾病、肺栓塞）引起的心肌损伤。净临床效果需要通过多项研究进行 HSROC 建模来了解权衡。

肌钙蛋白生成转变：2010 年代

An emergency department adopts hs-troponin. More patients now test positive, but many do not have acute MI.

PATH A: Adopt Based on Sensitivity Alone

庆祝 MI 检测率从 70% 跃升至 95% 以上

↓

更多假阳性导致不必要的导管插入、住院和患者对非心脏疾病的焦虑肌钙蛋白升高

OUTCOME: Overdiagnosis and wasted resources

路径 B：权衡建模

Use serial measurements (0h/1h or 0h/3h protocols) and clinical context to maintain specificity

↓

Rapid rule-out algorithms safely discharge low-risk patients; sensitivity remains high while managing the false positive rate

OUTCOME: Faster, safer triage of chest pain

THE REVELATION

敏感性和特异性相互权衡。提高敏感性的新一代测试通常会降低特异性。 HSROC 曲线是揭示净权衡对患者有利还是有害的工具。

Cochrane 判决

COVID-19 Rapid Antigen Tests (Dinnes 2022 Cochrane Review)

Population	Sensitivity	Missed
Symptomatic	73%	27%
Asymptomatic	55%	45%
First 7 days	80%	20%

Dinnes J et al. Cochrane Database Syst Rev. 2022;7:CD013705

The False Security Decision Tree

Thanksgiving 2020: What Happened

Family Member Tests Negative

↓

Truly Negative?

55% if asymptomatic

True NegativeSafe to gather

45% if asymptomatic

FALSE NegativeInfectious!

↓

与家人聚集Grandparents infected

“测试结果显示‘阴性’，
和家人拥抱，
到冬天结束时，
祖父被埋葬了。”

你有没有听说过筛查
发现癌症 would never kill,
并导致治疗 caused more harm than the disease?

Can you trust a DTA meta-analysis done in a spreadsheet?

REAL DATA

DTA 荟萃分析需要双变量模型或 HSROC，两者都需要对 Logit 量表上的相关敏感性和特异性进行最大似然估计。研究表明，手动 Excel 计算经常会引入错误：Reinhart 和 Rogoff（2010 年，经济学）的一项具有里程碑意义的研究表明，一个简单的电子表格错误如何导致全球政策变化。在 DTA 中，在 Excel 中手动应用 logit 变换并单独汇总敏感性/特异性会忽略它们之间的相关性，并且可能会生成与软件（R mada/reitsma、Stata metandi、SAS NLMIXED）中经过验证的双变量模型显着不同的汇总估计值。

QUADAS Excel 错误

研究团队需要汇总敏感性和特异性来分析DTA 系统审查。他们有 12 项研究。一名团队成员构建 Excel 模型；

路径A：使用电子表格

Pool sensitivity and specificity separately in Excel using simple averages or fixed-effect formulas

↓

忽略敏感性和特异性之间的相关性；另一个使用R的mada包。 Logit 变换误差复合；汇总敏感性降低约 12 个百分点

OUTCOME: Wrong numbers published; clinical guidelines misled

PATH B: Use Validated Software

将 R (mada/reitsma)、Stata (metandi) 或 SAS (NLMIXED) 与双变量模型结合使用

↓

正确的双变量 GLMM 考虑敏感性-特异性权衡，生成有效的置信区域，并处理研究间异质性

OUTCOME: Reproducible, auditable, correct results

THE REVELATION

DTA 荟萃分析不是简单的合并。数据的双变量性质（配对敏感性和特异性）需要专门的统计软件。电子表格错误不仅带来不便，还可能改变临床实践。

过度诊断问题

3-4

Lives saved
per 10,000 screened

50-130

Overdiagnosed
(treated unnecessarily)

~500

False alarms
(anxiety, biopsies)

THE QUESTION

为了挽救 3-4 条生命，估计有 50-130 名女性接受了永远不会伤害她们的癌症手术、放疗或化疗。

这种权衡值得吗？

Patient Decision Aid: Mammography

如果对 10,000 名 50-69 岁的女性进行为期 10 年的筛查

Deaths from breast cancer prevented

3-4 women

Women called back for false alarms

~500 women

Unnecessary biopsies

~200 women

女性接受永远不会伤害他们的癌症治疗

~15 women

筛查适合您吗？

The Screening Cascade Decision Tree

10,000 名女性经过 10 年的筛查

10,000 Women

↓

~1,000 RecalledAbnormal

↓

~500 False
Alarm

~500 Biopsy
~50 cancer

~9,000 Cleared

Of ~50 Cancers Found

~35 Would Kill3-4 saved

~15 Would Never KillOverdiagnosed

“测试发现了影子，
并将其称为癌症，
而这位女士被割伤并被烧伤——
为了一个永远不会让她的日子变得黑暗的阴影。”

您没有听说过扫描
发现大脑中的斑块，
但无法告诉您
大脑是否会 fade?

淀粉样蛋白悖论

ALZHEIMER'S RESEARCH, 2010s-2020s

PET scans can now detect amyloid plaques—the hallmark of Alzheimer's.

But 30% of cognitively normal elderly have amyloid plaques. They may never develop dementia.

And 10-20%的人患有痴呆 have no amyloid.

测试发现了斑块。但斑块不是疾病。 我们正在测试替代物，而不是结果。

Jack CR et al. Lancet Neurol. 2018;17:760-773

Surrogate vs. Outcome Decision Tree

我们真正测试的是什么？

Diagnostic Test

↓

What Does It Detect?

Outcome itself

Direct Diagnosis例如，活检癌症

↓

High clinical value

Surrogate marker

Indirect Signal例如，用于痴呆症的淀粉样蛋白

↓

Validated link?

YesUse cautiously

NoLimited value

“扫描发现了斑块，
医生将其命名为阿尔茨海默病，
患者居住在恐怖——
of a forgetting that might never come."

并不是所有的研究都是平等的。

Some are biased.
Some are poorly designed.
有些不应该 trusted.

我们如何将小麦与小麦分开箔条？

如果大多数 DTA 研究甚至没有报告足够的信息来判断她们的情况该怎么办？

REAL DATA

在 2003 年 STARD 倡议发布之前，一项系统评估发现，只有不到 half 的 DTA 研究报告了指标测试解释是否采用盲法，并且参考标准描述常常不充分。 STARD 之后，报告有所改进：多项元流行病学评估发现，对 STARD 项目的遵守率大幅上升，尽管许多研究在流程图和不确定结果处理等关键项目上仍存在不足。

STARD 革命：2003 年

一个团队完成了一项新的即时检测的 DTA 研究。他们渴望尽快出版。他们拥有 2x2 数据，但没有记录盲法、患者流程或不确定结果。

PATH A: Publish Quickly

在没有 STARD 流程图或完整方法报告的情况下提交

↓

读者无法评估盲法、患者谱系或验证。 QUADAS-2 评估将每个领域评为“不清楚”。该研究可能会被排除在未来的系统评价之外，或者更糟糕的是，可能会被纳入夸大的体重中。

OUTCOME: Waste of research; uninterpretable results

PATH B: Follow STARD Guidelines

完成 STARD 检查表，创建患者流程图，报告不确定的结果，并描述盲法

↓

评价者可以全面评估质量。 QUADAS-2 域是负责任的。该研究对系统评价和临床指南做出了有意义的贡献。

结果：推进护理的可信证据

THE REVELATION

如果研究未报告其方法，则无法评估质量。 STARD 确保 DTA 研究足够完整，可供 QUADAS-2 判断。不完整的报告不是中立的，它隐藏了偏见。

QUADAS-2：质量检查表

Four Domains of Risk of Bias

1

Patient Selection

是连续样本还是随机样本入组？是否避免了病例对照设计？

2

Index Test

是否在不了解参考标准的情况下解释了测试？阈值是否预先指定？

3

Reference Standard

参考标准是否可能正确分类病情？是否盲目解释？

4

流程和时间

测试之间是否有适当的间隔？所有患者都接受相同的参考标准吗？

QUADAS-2 Decision Tree

您应该相信这项研究吗？

DTA Study

↓

Check All 4 Domains

All Low Risk

High QualityTrust results

Some Unclear

Moderate谨慎使用

Any High Risk

Low Quality结果可能有偏差

DTA 研究中的常见偏差

!

Verification Bias

Only positive tests get the reference standard → inflates sensitivity

!

Spectrum Bias

研究人群与临床不同现实→结果不能一概而论

!

Incorporation Bias

Index test is part of reference standard → artificially high accuracy

!

Review Bias

Index test interpreted knowing reference result → inflates both metrics

“在您相信数字之前，
ask: How were they gathered?
一项有偏见的研究充满信心地说话—
but its confidence is a lie."

一项研究可能会欺骗。
一项研究可能会让人满意。

但是当您收集 所有证据—
the truth becomes harder to hide.

当不同的研究对同一测试使用不同的阈值并且您尝试将它们合并时会发生什么？

REAL DATA

D-dimer testing for pulmonary embolism (PE) traditionally used a fixed cutoff of 500 µg/L。 ADJUST-PE 试验（Righini 等人，JAMA 2014）表明，年龄调整截止值（age × 10 µg/L 对于 50 岁以上的患者）增加了 ~6% to ~30%D-二聚体结果呈阴性的老年患者的比例，在年龄调整阴性组中，3 个月 VTE 风险仅为 0.3%。 D-二聚体研究的 DTA 荟萃分析必须使用双变量模型，因为不同的阈值会产生 SROC 曲线上可见的敏感性-特异性权衡。

The D-dimer Threshold Dilemma: ADJUST-PE 2014

一位老年患者（75 岁）向急诊科就诊，可能患有 PE。 D-二聚体为 620 µg/L。使用固定截止值，这是积极的。使用年龄调整临界值 (750 µg/L)，结果为阴性。

PATH A: Use Fixed Cutoff (500 µg/L)

Apply one threshold to all patients regardless of age

↓

老年患者几乎总是超过 500 µg/L。 80 岁以上患者的特异性降至 10% 以下。几乎每位老年患者都会接受 CT 肺血管造影，包括对比染料、放射线和偶然发现。

OUTCOME: D-dimer becomes useless in the elderly

PATH B: Use Bivariate Model with Threshold Covariate

应用年龄调整截止值；荟萃分析中的模型阈值变化

↓

SROC 曲线显示，年龄调整阈值沿着曲线移动，用少量的敏感性换取了特异性的大幅提高。老年患者安全避免 CT 成像的比例增加了 30%。

OUTCOME: Fewer unnecessary scans; no missed PEs

THE REVELATION

阈值变化是 DTA 荟萃分析需要双变量模型的原因。不同的研究使用不同的临界值，在敏感性和特异性之间进行权衡。 SROC 曲线就是这种权衡的图。

Why DTA Meta-Analysis Is Different

THE PROBLEM

敏感性和特异性是 correlated. When one goes up, the other tends to go down.

您不能像治疗效果那样将它们分开汇总。您需要 bivariate model.

SROC曲线

Summary Receiver Operating Characteristic

Sensitivity

1 - Specificity (False Positive Rate)

Individual studies

Summary estimate

读取 SROC

曲线告诉您什么？

SROC Curve Position

↓

Top-Left Corner

Excellent TestHigh sens + spec

Near Diagonal

Useless TestNo better than chance

Points Scattered

High HeterogeneityInvestigate sources

“一项研究可能会欺骗。
许多研究，权衡一起
追踪真理之路——
揭示测试真正作用的SROC曲线。”

但是如果研究 disagree?

One says sensitivity is 95%.
Another says 60%.

你相信哪个真理？

如果一项测试在一般人群中效果良好，但在最需要的患者中失败怎么办？

REAL DATA

HRP2-based malaria rapid diagnostic tests (RDTs) achieve sensitivity of approximately 95% in the general population in endemic areas. However, in pregnant women, sensitivity can drop to as low as 56-76% 由于寄生虫的胎盘隔离 - 寄生虫隐藏在胎盘中，使外周血寄生虫血症保持较低水平并低于 RDT 检测阈值。 Cochrane 对疟疾 RDT 的综述发现，由于妊娠、5 岁以下儿童和 HIV 合并感染等人群亚组的影响，存在显着的异质性（I² 通常超过 80%）。

妊娠期疟疾 RDT

荟萃分析汇总了 25 项疟疾 RDT 研究，报告汇总敏感性为 93%。产前诊所的临床医生用它来安抚 RDT 阴性的孕妇。

PATH A: Trust the Overall Pooled Estimate

应用一般人群荟萃分析中的 93% 敏感性

↓

在孕妇中，真正的敏感性可能低至 56-76%。很大一部分受感染的孕妇被错误地放心了。怀孕期间未经治疗的疟疾会导致严重的孕产妇贫血、低出生体重和死产。

OUTCOME: Preventable maternal and neonatal deaths

PATH B: Investigate Heterogeneity by Subgroup

对孕妇进行亚组荟萃分析；探索 I² 和变化来源

↓

发现怀孕是异质性的主要来源。建议对流行地区所有 RDT 阴性的孕妇进行显微镜确认。

OUTCOME: Targeted protocols save mothers and babies

THE REVELATION

异质性不仅仅是统计噪音。它通常表明测试在不同人群中表现不同。忽略 I² 并将所有内容集中在一起对于弱势亚群来说可能是致命的。

Sources of Heterogeneity

为什么研究不同意

相同的测试，不同的结果？

ThresholdDifferent cutoffs

PopulationSeverity, age

SettingPrimary vs specialist

QualityBias, blinding

Measuring Disagreement: I²

I² < 25%

Low
Studies agree

I² 25-75%

Moderate
Some variation

I² > 75%

High
Major disagreement

THE WARNING

When I² > 75%, the pooled estimate may be meaningless. Explain the disagreement before averaging.

“当研究存在分歧时，
不要压制异议。
Ask: Why do they see differently?
分歧本身就说明了一切。”

您的 DTA 工具包

基本措施以及何时使用它们

当 AI 声称比医生诊断更好时，您应该相信整体 AUC 吗？

REAL DATA

Deep learning models for skin cancer detection have reported AUC values as high as 0.91-0.94 in development datasets. However, external validation revealed alarming disparities: Daneshjou et al. (2022，《自然医学》）发现商业 AI 皮肤病学工具在深色皮肤（Fitzpatrick V-VI 型）上以近乎偶然的水平执行，AUC 很低如 0.50-0.57 ——本质上是随机的。训练数据集严重偏向浅肤色，这意味着 2x2 表格永远无法正确填充所有人群。

AI 皮肤科承诺：2020 年代

一家医院考虑在为多元化城市人口服务的皮肤科诊所部署 AI 皮肤癌筛查工具。制造商报告的 AUC 为 0.94。

PATH A: Deploy Based on Overall AUC

相信标题 AUC 0.94，并将其部署到所有患者

↓

深色皮肤上的黑色素瘤的漏诊率较高。总体敏感度数字隐藏着一个危险的差距。晚期诊断死亡率最高的患者是 AI 最失败的患者。

OUTCOME: Health disparity amplified by technology

PATH B: Demand Fairness-Stratified Evaluation

需要按肤色（Fitzpatrick 量表）、年龄和病变位置细分的敏感性和特异性

↓

发现性能差距。需要对不同的数据集进行再训练或限制对经过验证的人群的使用。将人工智能与皮肤科医生的监督结合起来，对代表性不足的群体进行监督。

OUTCOME: Equitable deployment; no one left behind

THE REVELATION

单个 AUC 数字可能隐藏危险的差异。新兴的基于人工智能的诊断工具必须像任何诊断测试一样严格地进行评估：按人群分层，外部验证，并遵守 STARD 和 QUADAS-2 标准。

The Checklist

✓

Was there a valid reference standard?

Gold standard applied to ALL patients?

✓

口译员是否被蒙蔽了？

Test readers unaware of diagnosis?

✓

频谱是合适吗？

与您的人群相似的患者？

✓

阈值是否预先指定？

还是选择最大化结果？

When Results Don't Match Suspicion

The Clinical Override Decision Tree

Test Negative, High Suspicion

↓

What Is the LR-?

LR- < 0.1

Strong rule-outAccept negative

LR- 0.1-0.5

Repeat testOr different test

LR- > 0.5

Trust judgmentTest is weak

Sequential Testing Decision Tree

When One Test Isn't Enough

Initial Screening Test

↓

Positive

↓

Confirmatory TestHigh specificity

↓

PositiveDiagnose

NegativeFalse alarm

Negative

↓

Likely negativeIf high sens screen

"Armed with sensitivity, specificity, likelihood,
配备了 SROC 和一致性度量，
您可以通过测试的谎言 -
并自行判断其真实性。”

您是否听说过接受输血的患者
谁收到了 wrong blood,
不是因为测试错误，
but because no one performed it?

未完成的测试

HOSPITALS WORLDWIDE

ABO blood typing is nearly 100% accurate when performed.

Yet transfusion reactions still kill——不是因为测试失败，而是因为 human failure:

• Wrong blood drawn from wrong patient
•实验室中更换的标签
• Bedside check skipped in emergency

In the UK, 1 in 13,000 transfusions 给了错误的患者。测试有效。系统失败。

Bolton-Maggs PHB. Transfus Med. 2016;26:303-311

Test vs. System Decision Tree

Where Can Things Go Wrong?

Diagnostic Process

↓

Error Source?

Test itself

Analytical ErrorSens/Spec issue

↓

Better test needed

Pre-analytical

Wrong sampleID error

↓

System fix needed

Post-analytical

Wrong actionReporting error

↓

Process fix needed

"The perfect test means nothing
如果抽取了错误的血液，
贴上了错误的标签，
挂了错误的袋子。”

DTA 研究测量测试准确性。它们不测量系统准确性。

References

Key Sources

Carreyrou J. Bad Blood. Knopf, 2018. [Theranos]
CDC. MMWR. 1987;36(49):833-840. [HIV blood supply]
Herbst AL et al. N Engl J Med. 1971;284:878-881. [DES]
Moyer VA. Ann Intern Med. 2012;157:120-134. [PSA]
Pope JH et al. N Engl J Med. 2000;342:1163-1170. [Troponin]
Steingart KR et al. Cochrane 2014;1:CD009593. [GeneXpert]
Dinnes J et al. Cochrane 2022;7:CD013705. [COVID RAT]
UK Panel. Lancet. 2012;380:1778-1786. [Mammography]
Jack CR et al. Lancet Neurol. 2018;17:760-773. [Amyloid]
WHO. Malaria RDT Performance. 2022.
Reitsma JB et al. J Clin Epidemiol. 2005;58:982-990. [Bivariate]
Whiting PF et al. Ann Intern Med. 2011;155:529-536. [QUADAS-2]
Bolton-Maggs PHB. Transfus Med. 2016;26:303-311.

测试的敏感性为 99%，特异性为 99%。 1/1000。患者感染该疾病的概率是多少？

99%

90%

About 9%

50%

What does "SnNout" mean?

A highly Sensitive test, when Negative, rules OUT disease

A highly Specific test, when Negative, rules OUT disease

Sensitivity should be used for screening

Specificity should be above 90%

为什么尽管进行了检测，血液供应仍被 HIV 污染？

The tests had low specificity

Tests had a window period with zero sensitivity in early infection

检测未正确执行

检测太差了。昂贵的

哪个 QUADAS-2 域评估是否在不知道诊断的情况下解释了测试？

Patient Selection

Index Test

Reference Standard

流程和时间

✔

Course Complete

“现在你知道了四种结果，
测试的两个优点，
基本比率的谬误，
池化的艺术证据，
以及隐藏真相的偏见。

当下一个测试对你不利时——
你会知道的。"