证据逆转：荟萃分析课程

并非每个信号都是真实的。

模块0：开头

🎯 Learning Objectives

定义荟萃分析并解释其在证据合成中的作用
确定何时不应汇总研究
描述证据层次结构以及系统评价的位置
Recognize that meta-analysis can mislead when done poorly
回顾本课程的七个原则

本课程的存在是因为

医学是错误的。

一次都没有。并不罕见。反复。这些方式杀死了那些相信证据可靠的病人。

What is Meta-Analysis?

一种统计方法，用于结合解决同一问题的多个独立研究的结果。

1976

Term coined by Gene Glass

~50,000

Published per year

#1

Evidence hierarchy*

*When well conducted. Quality of conduct matters more than study design alone — as GRADE recognizes.

为什么要进行合并研究？

1

Increase Statistical Power

Individual studies may be too small to detect effects.

2

Improve Precision

Narrower confidence intervals around effect estimates.

3

Resolve Disagreement

当研究发生冲突时，合并可以澄清信号。

4

Explore Heterogeneity

Identify why effects differ across populations or settings.

But meta-analysis can also

MISLEAD

When done poorly, it amplifies bias rather than truth.

何时不合并

1

研究衡量的是根本不同的事物（苹果和苹果）橙子）

2

Extreme heterogeneity that cannot be explained

3

One study dominates all others (megastudy problem)

4

研究具有很高的偏倚风险，无法调整

汇集是一种特权，而不是权利。

The decision to combine must be defended.

证据的层次结构

Systematic Reviews & Meta-Analyses of RCTs

Randomized Controlled Trials

Cohort Studies

Case-Control Studies

Case Series / Expert Opinion

层次结构中的位置取决于方法质量，而不是研究类型

本课程通过

evidence reversals.

每个模块以一个医学如何出错的故事开始。然后我们学习可以防止伤害的方法。

七原则

这些短语将在您的整个旅程中返回：

1. “并非每个信号都是真实的。”

2. “方法保护患者免受我们的信任。”

3. "What was hidden in plain sight?"

4. “没有来源的数字不是一个“

5. “异质性是一条消息，而不是噪音。”

6. “缺乏证据并不等于不存在。”

7. "Certainty must be earned, not assumed."

Module 0 Quiz

1.为什么有时不应该在荟萃分析中汇集研究？

A. Pooling is always better than single studies

B. When heterogeneity is extreme or studies measure different things

C. Pooling is always appropriate for RCTs

D. Statistical methods handle any situation

2。 RCT 的系统评价在证据层次中位于何处？

A. At the top

B. Same level as individual RCTs

C. 下面的队列研究

D. Same as expert opinion

开始旅程。

模块 1：问题

并非每个信号都是真实的。

这个不是一个关于错误的故事。

这是一个关于确定性的故事。

模块 1：问题

🎯 Learning Objectives

制定一个有针对性的 PICO 问题以进行系统评价
Distinguish surrogate outcomes from patient-important outcomes
Explain why biological plausibility alone is insufficient evidence
描述 CAST 试验及其对循证医学的影响
应用原则：“并非每个明亮的迹象都是指导”

~9,000

excess deaths per year

From a treatment everyone believed worked.

这是一个关于我们如何相信以及我们如何错误的故事。

The Observation

Patients with frequent PVCs after MI had 2-5x higher mortality.

400,000+

MI survivors/year

~40%

具有重要的PVC

160,000

at elevated risk

A massive clinical need. A clear target.

The Response

Antiarrhythmic drugs were developed, FDA approved,
and prescribed to ~200,000 patients per year.

这个故事中没有出现恶棍。

每个人都根据现有的最佳证据采取行动。

逻辑让每个人都相信

PREMISE 1

PVCs after MI predict sudden cardiac death

↓

PREMISE 2

Antiarrhythmic drugs suppress PVCs

↓

PREMISE 3

Suppressing PVCs should prevent sudden death

↓

CONCLUSION

Antiarrhythmics save lives in post-MI patients

这条链是合乎逻辑的。这个结论感觉是不可避免的。

CAST: The Cardiac Arrhythmia Suppression Trial

Finally, someone asked: "Does suppressing PVCs actually save lives?"

Design

Randomized, double-blind, placebo-controlled

Population

Post-MI patients with asymptomatic PVCs

Intervention

Encainide, flecainide, or moricizine vs placebo

Run-in

Only patients with ≥80% PVC suppression randomized

Primary endpoint

Death or cardiac arrest with resuscitation

Sample size

1,498 patients (encainide/flecainide arms)

结果：1989 年 4 月

数据安全监测委员会提前停止了试验。

Outcome	Drug (n=755)	Placebo (n=743)
Arrhythmic deaths	33	9
All cardiac deaths	43	16
Total deaths	56	22
Death rate	7.4%	3.0%

Relative Risk of Death: 2.5

95% CI: 1.6 - 4.5 | p < 0.001

完美抑制心律失常的药物使死亡率增加了 150%。

人类成本

Before CAST, ~200,000 Americans per year received these drugs.

~9,000

excess deaths per year - possibly more

Vietnam War: ~6,000 US deaths/year • These drugs: ~9,000+ deaths/year

For every number, a name we will never know.

Look again.

逻辑 - 重新审视

PREMISE 1

PVCs after MI predict sudden cardiac death

↓

PREMISE 2

Antiarrhythmic drugs suppress PVCs

← THE LEAP

↓

PREMISE 3

Suppressing PVCs should prevent sudden death

↓

CONCLUSION

Antiarrhythmics save lives in post-MI patients

抑制标记会修复结果的假设从未被测试过。

What Went Wrong: The Surrogate Trap

1

PVC 是受损组织的标志，而不是死亡原因

2

The drugs had proarrhythmic effects - triggering deadlier rhythms

3

替代者有所改善，但结果却恶化 - 分离的替代者

替代者没有撒谎。我们问了错误的问题。

PICO 框架

Every answerable clinical question has four components:

P - POPULATION

患者是谁？它们的特点是什么？

I - INTERVENTION

What treatment or exposure is being evaluated?

C - COMPARATOR

What is the alternative? Placebo? Standard care?

O - OUTCOME

What matters to patients? Hard endpoints vs surrogates.

CAST PICO

Post-MI patients with PVCs | Antiarrhythmics | Placebo | Mortality

🔍

调查练习：CAST 之前的证据

您是 1988 年的心脏病专家。一名患者在 MI 中幸存下来，但频繁发生 PVC。观察文献很清楚...

Study	室性早搏患者	Mortality Risk
Lown (1977)	High-grade PVCs	2.4x higher
Bigger (1984)	>10 PVCs/hour	3.1x higher
Mukharji (1984)	Complex PVCs	4.8x higher

信号很清楚。该机制是合理的。您会开抗心律失常药吗？

Before: Observational Logic

PVCs → Higher mortality

Drugs suppress PVCs

∴ Drugs should reduce mortality

After: CAST RCT (1989)

Death rate on drug: 7.4%

Death rate on placebo: 3.0%

RR = 2.5 (150% increase in deaths)

代理人有所改善。病人死了。这就是为什么我们问：“重要的结果是什么？”

证据综合的教训

1

生物学合理性并不是证据

A logical mechanism doesn't guarantee the expected effect.

2

Surrogate endpoints can mislead

Improving a biomarker doesn't prove improvement in outcomes.

3

随机试验提供了最强的因果证据

仅靠观察数据很少能证明

4

共识不是证据

200,000 个处方、FDA 批准和指南都是错误的。

This is why we do meta-analysis: to see past apparent truths.

故事：DES-II 代孕者悲剧

如果您问的问题怎么办决定谁生谁死？

REAL DATA

1989 年，心脏病专家知道使用恩卡尼和氟卡尼可以实现 PVC 抑制。替代终点看起来很完美：与安慰剂相比，药物通过 80%+. But CAST randomized 1,498 patients 抑制 PVC。试验提前停止： 56 deaths in the drug group vs 22 in placebo. Mortality increased 2.5-fold. An estimated ~9,000 excess American deaths per year 归因于这些药物。

心脏病专家的选择：1987

您的心肌梗死后患者经常出现室性早搏。你有完全抑制它们的药物。你会做什么？

路径 A：治疗替代者

Prescribe encainide — PVCs vanish, the ECG looks clean

↓

生物标志物改善。你感到自信。患者死亡。

OUTCOME: An estimated 50,000+ excess deaths across the US during years of use

PATH B: Demand a Mortality Trial

坚持：“让我看看生存率得到改善，而不仅仅是心电图”

↓

试验揭示了危害。药物被撤回。生命得到拯救。

结果：正确的 PICO 问题可以防止灾难发生

THE REVELATION

问题从来都不是“我们可以抑制 PVC 吗？”主题是“PVC 抑制能拯救生命吗？”替代端点回答了错误的问题。正确的 PICO 从一开始就要求将死亡率作为结果。

What appears certain may be wrong.

What everyone believes may be false.

方法的存在，让患者不必为我们的信心付出代价。

这就是你在这里的原因。

Module 1 Quiz

1。抗心律失常逻辑中的根本错误是什么？

A. 试验不是随机的

B. Treating a surrogate (PVCs) was assumed to improve outcomes

C. 样本量太小

D. FDA批准仓促

2。在 PICO 中，“O”代表什么以及为什么它很重要？

A. Observation - what researchers see

B. 目标 - 研究目标

C. Outcome - what matters to patients

D. 组织 - 研究结构

并非每个信号都是真实的。

方法保护患者免受我们的信任。

What was hidden in plain sight?

这是一个关于

observational evidence.

模块 2：协议

🎯 Learning Objectives

Explain why protocol pre-registration prevents bias
Identify key elements of a PROSPERO registration
Distinguish healthy user bias from true treatment effects
Describe why observational studies overestimated HRT benefits
应用原则：“方法保护患者免受我们的信任”信心”

30+

observational studies

All showing hormone replacement therapy protected postmenopausal women from heart disease.

证据似乎压倒性的。结论似乎是肯定的。

护士健康研究

122,000 nurses followed for decades. HRT users had 40-50% lower cardiovascular mortality.

RR 0.56

Cardiovascular mortality

122,000

Women followed

20+ years

Follow-up

Landmark study. Impeccable methodology. Wrong conclusion.

隐藏的偏见

1

Healthy User Bias: Women who chose HRT were healthier, wealthier, better educated

2

Compliance Bias: Women who took HRT consistently also took better care of themselves

3

Prescriber Bias: Doctors gave HRT to healthier women with fewer risk factors

治疗并不能保护他们。它们已经受到了保护。

WHI: The Women's Health Initiative

The largest randomized trial of HRT ever conducted.

Design

Randomized, double-blind, placebo-controlled

Population

Postmenopausal women aged 50-79

Intervention

Estrogen + Progestin vs Placebo

Sample size

16,608 women

Primary endpoint

Coronary heart disease

Planned duration

8.5 years

结果：2002 年 7 月

Trial stopped early after 5.2 years. Harm exceeded benefits.

Outcome	Hazard Ratio	Direction
Coronary heart disease	1.29	HARM
Stroke	1.41	HARM
Breast cancer	1.26	HARM
Pulmonary embolism	2.13	HARM

Complete Reversal

推翻了 30 年的观察证据

The Lesson

PRE-SPECIFY

A protocol written before the search begins prevents fishing, prevents bias, prevents hindsight distortion.

故事：激素时间假说

如果治疗有效会怎样？一些？

REAL DATA

WHI showed HRT increased cardiovascular events overall. But later analyses revealed a critical pattern: women who started HRT within 10 years of menopause had REDUCED cardiovascular risk. Women starting 20+ years after menopause had INCREASED risk. The overall null/harm result hid a timing effect.

分析师的困境

您正在分析 WHI 子组。总体结果显示出危害。您是否深入挖掘？

PATH A: Report Overall Only

Conclude HRT is harmful for all postmenopausal women

↓

Simple message. Guidelines recommend against HRT universally.

OUTCOME: Deny potential benefit to younger menopausal women

PATH B: Pre-Specify Timing Subgroups

Analyze by years since menopause (biologically plausible)

↓

发现安全 HRT 启动的“时间窗口”。

OUTCOME: Enable personalized recommendations

THE REVELATION

钓鱼时进行亚组分析是危险的。当生物学预测效果改变时，这是至关重要的。时间假设在生物学上是合理的，并且应该预先指定。

PROSPERO Registration

1

搜索前注册

PROSPERO: International prospective register of systematic reviews

2

锁定您的决定

PICO, search strategy, outcomes, analysis plan - all pre-specified

3

Document Amendments

允许更改，但必须透明且合理

4

Prevent Duplication

检查您的评论是否已存在开始

Module 2 Quiz

1。为什么护士健康研究显示 HRT 有益而 WHI 没有？

A. Nurses' Health had too few patients

B. Healthy user bias in observational studies

C. Nurses' Health had shorter follow-up

D. Different hormone formulations were used

2. What is the primary purpose of PROSPERO registration?

A. To register clinical trials

B. 加快审查完成

C. 预先指定方法并防止偏差

D. 获得审查资金

预先指定不是

It is protection.

Against our own tendency to find what we expect.

方法保护患者免受我们的信任。

What was hidden in plain sight?

模块 3：搜索

What was hidden in plain sight?

这是一个关于

what they didn't publish.

模块 3：搜索

🎯 Learning Objectives

Develop a comprehensive search strategy using PRESS guidelines
Search multiple databases including grey literature sources
Identify trial registries and regulatory databases (ClinicalTrials.gov, FDA)
Explain how the rosiglitazone case exposed hidden cardiovascular harms
应用原则：“什么隐藏在显而易见的地方？”

$3.2B

annual sales at peak

文迪雅（罗格列酮）是世界上最重要的药物之一最畅销的糖尿病药物。

已发表的试验看起来令人放心。未发表的证据讲述了不同的故事。

已发表的证据（2007 年之前）

Published trials showed rosiglitazone effectively lowered HbA1c. Cardiovascular outcomes were rarely reported.

1999

FDA approval

6M+

Patients treated

~0.7%

HbA1c reduction

代理人看起来不错。但实际的心血管事件又如何呢？

Nissen's Discovery: May 2007

Dr. Steven Nissen 从 GSK 自己的网站获得了未发表的试验数据。

法律和解要求 GSK 在网上发布临床试验结果。 Nissen 和 Wolski 分析了 42 项试验 - 其中许多从未在期刊上发表。

数据在技术上是公开的。

No one had systematically searched for it.

荟萃分析结果

Outcome	Odds Ratio	95% CI
Myocardial Infarction	1.43	1.03 - 1.98
CV Death	1.64	0.98 - 2.74

43% Increased Risk of Heart Attack

心肌梗塞的 p = 0.03

Published in NEJM. The FDA called an emergency advisory committee meeting.

The FDA Advisory Committee: July 2007

22-1

Voted: CV risk exists

20-3

继续上市警告

委员会存在分歧。有些人希望撤回它。一些人称荟萃分析存在缺陷。

但信号无法被忽视。

The Aftermath

1

Black box warning added for heart failure risk (2007)

2

Severe restrictions on prescribing in the US (2010)

3

Withdrawn 完全来自欧洲市场（2010）

4

FDA now requires cardiovascular outcome trials for all diabetes drugs

What a Comprehensive Search Requires

PUBLISHED

PubMed, Embase, CENTRAL, Web of Science

GREY LITERATURE

Conference abstracts, dissertations, regulatory docs

TRIAL REGISTRIES

ClinicalTrials.gov, WHO ICTRP, EU CTR

REGULATORY

FDA, EMA, Health Canada submissions

COMPANY DATA

GSK, Pfizer, Roche clinical trial registries

HAND SEARCH

Reference lists, contact authors, experts

新闻清单

Peer Review of Electronic Search Strategies

1

研究问题的翻译

搜索是否反映了PICO元素？

2

布尔和邻近运算符

AND、OR、OR 使用不正确吗？

3

Subject Headings

MeSH/Emtree 术语是否合适并分解？

4

Text Words

Synonyms, spelling variants, truncation?

PRESS Checklist (continued)

5

Spelling, Syntax, Line Numbers

是否存在会导致检索的错误失败？

6

限制和过滤器

日期、语言、研究设计限制是否适当？

Peer-reviewed searches substantially improve retrieval of key studies.

PRESS guideline: McGowan et al., 2016

Database Translation

必须针对每个数据库调整相同的搜索：

PubMed

"diabetes mellitus, type 2"[MeSH] OR "type 2 diabetes"[tiab]

Embase

'non insulin dependent diabetes mellitus'/exp OR 'type 2 diabetes':ti,ab

Subject headings, field tags, and operators differ between databases.

故事：达菲透明度运动

什么时候会发生什么您搜索 - 但什么也没找到？

REAL DATA

Governments stockpiled $9 billion 用于治疗大流行流感的奥司他韦（达菲）。科克伦合作组织试图审查证据。 77 clinical trials, full reports existed for only 20。罗氏拒绝共享 5 years的数据。当 BMJ 和 Cochrane 最终获得 over 160,000 pages of clinical study reports, they found: Tamiflu reduced symptoms by less than 1 day, with no evidence it prevented hospitalizations or complications.

审稿人的困境：2009

您正在更新达菲的 Cochrane 审评。已发表的试验看起来是积极的。但 57 项试验没有完整的报告。你是做什么的？

PATH A: Analyze What's Published

Use the 20 available trials. Conclude Tamiflu is effective.

↓

你的评论支持继续囤货。 90 亿美元花费在薄弱的证据上。

OUTCOME: Billions wasted, true efficacy unknown

路径 B：要求完整数据

Refuse to publish until all trial data is accessible

↓

5-year campaign. 160,000+ pages finally obtained. Truth emerges.

OUTCOME: Evidence policy changed; EMA now publishes all trial reports

THE REVELATION

只有能找到的内容，搜索才算好。当灰色文献隐藏在公司墙后时，即使是最全面的 PubMed 搜索也会错过真相。达菲传奇改变了全球政策：EMA 现在发布所有药物的临床研究报告。

If Nissen had searched only PubMed,

the signal would have remained hidden.

Comprehensive search is survival.

What was hidden in plain sight?

Module 3 Quiz

1。什么类型的证据来源揭示了罗格列酮心血管信号？

A. Published journal articles

B. Cochrane Library

C. Company clinical trial registry

D. FDA approval documents

2. What does PRESS stand for?

A. 证据检索标准的出版审查

B. Peer Review of Electronic Search Strategies

C. 报告证据综合研究的协议

D. Primary Research Evidence Search System

What was hidden in plain sight?

模块 4：筛选

没有出处的数字不是数字。

这是一个关于

what they chose to report.

模块 4：筛选

🎯 Learning Objectives

Apply PRISMA flow diagram to document study selection
Implement dual-reviewer screening with conflict resolution
识别选择性结果报告和数据操作
Calculate inter-rater reliability (Cohen's kappa)
应用原则：“没有出处的数字不是数字”

88,000

heart attacks attributed to Vioxx

A blockbuster drug. A hidden signal. A preventable catastrophe.

1999年至2004年间，数百万人服用了这个止痛药。有些人再也没有回家。

万络的崛起

罗非昔布 (Vioxx) 是一种 COX-2 选择性 NSAID。市场上宣称对胃部比传统止痛药更安全。

1999

FDA approval

$2.5B

Peak annual sales

80M+

Patients prescribed

VIGOR 试验 (2000)

Vioxx Gastrointestinal Outcomes Research

Design

Randomized, double-blind

Comparison

Vioxx vs Naproxen

Population

Rheumatoid arthritis

Sample

8,076 patients

Primary Outcome

GI events

Published

NEJM, November 2000

What VIGOR Published

GI Outcome	Vioxx	Naproxen
Confirmed GI events	2.1 per 100 pt-yrs	4.5 per 100 pt-yrs
Reduction	54% fewer GI events

标题：万络对胃部更安全！

医生是这么被告知的。这是患者所相信的。

What VIGOR Buried

CV Outcome	Vioxx	Naproxen
Myocardial Infarction	20 events	4 events
Relative Risk	5x higher in Vioxx group

5-fold Increase in Heart Attacks

Mentioned only briefly, attributed to naproxen being "cardioprotective"

选择性报告

1

数据截止操作： 3 additional heart attacks occurred after the cutoff used in publication

2

Spin: CV信号被解释为萘普生具有心脏保护作用（没有证据）

3

Outcome switching: CV事件是预先指定的但没有强调

4

Internal knowledge: 默克公司的电子邮件表明他们了解该信号

APPROVe 试验（2004 年）

预防结直肠息肉的试验 - 为了安全起见提前停止。

RR 1.92

CV events vs placebo

Sept 2004

Vioxx withdrawn

Four years after VIGOR showed a 5x risk. Four years too late.

故事：万络决策树

您是否考虑过当信号出现时会发生什么？隐藏在噪音中？

REAL DATA

Vioxx（罗非昔布）在 1999. By 2004, estimates suggest 88,000-140,000 excess heart attacks and 30,000-40,000 deaths. Merck's own VIGOR trial showed 5x cardiovascular risk in 2000—but it was dismissed as a "naproxen cardioprotective effect."

岔路口

您是 2001 年的 FDA 审评员。VIGOR 数据显示 Vioxx 与 Vioxx 相比，心脏病发作风险为 5 倍萘普生。

路径 A：接受解释

Believe Merck's hypothesis: naproxen is cardioprotective

↓

No additional safety studies required. Drug stays on market at full speed.

结果：4 年内 40,000 多人死亡

路径 B：需求证据

Require a dedicated CV safety trial before continued marketing

↓

Delay or restrict marketing until cardiovascular safety is established.

OUTCOME: Signal detected early, lives saved

THE REVELATION

信号在 2000 年就已出现。错误的解释使行动延迟了 4 年。另一种假设（未经证据而被接受）导致数万人丧生。

PRISMA 流程图

Every step of screening must be documented and transparent.

Identification

Records from databases + other sources

↓

Screening

Title/abstract review (duplicates removed)

↓

Eligibility

Full-text assessment (with exclusion reasons)

↓

Included

Studies in synthesis

Dual Screening: Why Two Reviewers?

1

Reduces Selection Bias

One reviewer might unconsciously favor certain studies

2

Catches Errors

疲劳、误读和错误是不可避免的

3

Forces Explicit Criteria

Disagreements reveal ambiguity in inclusion rules

Typical agreement: κ = 0.6-0.8

Disagreements resolved by discussion or third reviewer

校准：试验阶段

Before screening thousands of records, reviewers should calibrate on a sample of 50-100 records.

1

Screen the same set independently

2

Compare decisions and discuss disagreements

3

Refine inclusion criteria until κ > 0.7

4

记录校准过程和任何规则更改

PRISMA 2020 Updates

New in 2020

Separate reporting of database vs register searches

New in 2020

必须报告自动化工具

New in 2020

Citation searching documented separately

New in 2020

Reasons for exclusion at full-text mandatory

PRISMA 2020 大幅修改了清单，扩展了合成方法、确定性评估和方案注册的报告。

If Vioxx's cardiovascular data had been screened by independent reviewers,

if all pre-specified outcomes had been required to be reported,

88,000 heart attacks might have been prevented.

没有出处的数字不是数字。

Module 4 Quiz

1。在 VIGOR 试验中，与萘普生相比，万络组发生 MI 的相对风险是多少？

A. 1.5x higher

B. 2x higher

C. 5x higher

D. 10x higher

2. Why is dual screening (two independent reviewers) important?

A. It makes screening faster

B. It reduces selection bias and catches errors

C. 它减少了要审查的研究数量

D. It allows reviewers to skip full-text review

没有出处的数字不是数字。

模块 5：提取

没有出处的数字不是数字。

这是一个关于

从未存在过的数字。

模块 5：提取

🎯 Learning Objectives

设计具有出处字段的标准化数据提取表格
Calculate effect sizes from various reported statistics (OR, RR, HR, SMD)
Implement dual-extraction with discrepancy resolution
识别数据伪造和不当行为的危险信号
Explain how the DECREASE fraud affected clinical guidelines

~10,000

possible excess deaths in Europe

根据基于捏造的临床试验的指南数据。

DECREASE 试验影响了全世界的围手术期护理。数据被发明了。

Don Poldermans: A Star Researcher

Professor at Erasmus Medical Center, Rotterdam. Author of over 500 papers. Lead author of ESC guidelines on perioperative cardiac care.

500+

Publications

DECREASE

Trial series I-VI

ESC

Guideline chair

看似无懈可击的来源。直到有人查看数据。

DECREASE 试验：声明

Trial	Finding	Impact
DECREASE-I (1999)	90% reduction in cardiac death	Changed guidelines
DECREASE-IV (2009)	Beta-blockers safe in low-risk	Expanded recommendations

Effect sizes were implausibly large.

90% reduction? Almost nothing in medicine works that well.

The Investigation: 2011

1

Erasmus MC investigated after whistleblower complaints

2

伪造的患者数据： Patients who didn't exist or weren't enrolled

3

No informed consent: Many "participants" never consented

4

Poldermans dismissed: From Erasmus MC in 2011

一系列危害

当 DECREASE 被从荟萃分析...

Benefit → Harm

Direction reversed

27% ↑

Stroke risk increase

POISE 试验 (2008) 已显示出危害。它被驳回，因为它与 DECREASE 冲突。

为什么没有被捕获？

1

Trust in authority: Poldermans 是审查自己证据的指南作者

2

No data verification: 没有人要求提供个体患者数据

3

Publication prestige: Published in top journals, assumed valid

4

Implausible effects accepted: 90% reductions should raise suspicion

Data Extraction: Defense Against Fraud

1

Dual Extraction

Two extractors independently - catches transcription errors and forces scrutiny

2

Record Provenance

Table, page, paragraph - every number traceable to source

3

Verify Against Registry

ClinicalTrials.gov 结果与出版物 -差异是危险信号

4

Request IPD

Individual patient data reveals what aggregate summaries hide

Effect Size Calculation

在提取过程中，您可以根据报告的数据计算效应大小：

BINARY OUTCOMES

Odds Ratio, Risk Ratio, Risk Difference from 2x2 tables

CONTINUOUS OUTCOMES

均值差、平均值和标准差的标准化均值差

始终从最可靠的来源提取。

Prefer: ITT results > per-protocol > subgroups

Red Flags During Extraction

!

Implausible effect sizes: 80-90% reductions should prompt scrutiny

!

Baseline imbalances: “过于完美”匹配的组

!

Round numbers: "Exactly 50" or "exactly 100" patients per arm

!

Registry discrepancies: 已发布的 N 与已注册的 N

Researcher

Effect Size Conversions

研究报告结果具有不同的指标。为了汇集它们，您通常需要转换：

From	To	Formula
SMD (d)	log-OR	log-OR = d × π / √3
log-OR	SMD (d)	d = log-OR × √3 / π
Correlation (r)	Fisher z	z = 0.5 × ln((1+r)/(1−r))
OR	RR	RR = OR / (1 − P₀ + P₀ × OR)
OR	NNT	NNT = 1 / (P₀ − OR×P₀ / (1−P₀+OR×P₀))

P₀ = 对照组的基线风险。这些公式假设了近似条件；参见博伦斯坦等人。（第 7 章）用于精确推导。

Researcher

事件时间（生存）数据

Many trials report time-to-event outcomes using hazard ratios (HR). Pooling HRs in meta-analysis requires special handling:

1

log(HR) + SE 方法

从试验中提取 log(HR) 及其 SE。如果未报告，则从 CI 导出 SE：SE = (ln(上) − ln(下)) / (2 × 1.96)。使用标准反方差方法进行池化。

2

未报告 HR 时

存在根据 Kaplan-Meier 曲线重建 IPD 的方法（Guyot 等人，2012 年）或根据 p 值和事件计数估计 HR（Parmar 等人，1998 年）。总是更喜欢直接报告的调整后的心率（如果有）。

HR < 1 favors treatment; HR > 1 favors control. Do not convert HRs to ORs or RRs—they measure fundamentally different quantities.

故事：Boldt 胶体丑闻

如果您提取的数据从来都不是真实的怎么办？

REAL DATA

Joachim Boldt 是麻醉液管理领域最多产的研究人员。他的超过 180 篇出版物被撤回 ——医学史上最大的撤回案例之一。他伪造的数据表明羟乙基淀粉（HES）是安全的。包括他的研究在内的荟萃分析得出结论，HES 是无害的。当 Boldt 的研究被删除后，合并效应逆转: HES increased kidney injury by 59% (RR 1.59, 95% CI 1.26-2.00) and mortality by ~9% (RR 1.09). An estimated thousands of patients received a harmful fluid based on fabricated evidence.

提取者的警惕：2010

您正在提取数据以进行液体复苏荟萃分析。 Boldt 的研究在文献中占据主导地位（90 多篇论文）。一位举报人提出了担忧。您做什么？

PATH A: Extract as Published

Trust peer-reviewed publications. Extract Boldt's data like any other.

↓

Your meta-analysis shows HES is safe. Guidelines recommend it.

OUTCOME: Thousands receive a nephrotoxic fluid

PATH B: Verify Provenance

交叉检查道德批准，请求源数据，进行敏感性分析，排除可疑研究

↓

Discover missing ethics approvals. Flag studies. Re-analyze without them.

OUTCOME: True signal emerges — HES causes harm

THE REVELATION

出处不是官僚机构。这就是证据和虚构之间的区别。每个提取的数字都必须追溯到经伦理批准的研究，并具有可验证的患者数据。没有出处，没有所有者的数字可以成为武器。

元分析中的每个数字

must trace back to a verifiable source.

没有出处的数字不是数字。

Fraudulent data can kill as surely as fraudulent drugs.

Module 5 Quiz

1。当 DECREASE 试验数据从 β 受体阻滞剂荟萃分析中删除时发生了什么？

A. The benefit became even larger

B. No change in conclusions

C. The direction reversed to show potential harm

D. 结果变得不确定

2. Why should dual extraction be standard practice?

A. It catches transcription errors and forces scrutiny

B. It makes extraction faster

C. 它有助于找到更多研究

D. It reduces the amount of work needed

没有出处的数字不是数字。

模块 6：偏差

方法保护患者免受我们的信任。

这是一个关于

我们看不到的偏差。

模块 6：偏差

🎯 Learning Objectives

Apply Risk of Bias 2.0 (RoB 2) to randomized trials
将 ROBINS-I 应用于非随机研究
Assess all five RoB 2 domains (randomization, deviations, missing data, measurement, selection)
Distinguish confounding by indication from true treatment effects
Explain how BART revealed hidden harms of aprotinin

20+

上市多年

抑肽酶是减少手术的黄金标准出血。

然后有人进行了随机对照试验。事实并非如此。

The Hidden Bias: Confounding by Indication

1

Sicker patients got aprotinin: Surgeons used it in complex, high-risk cases

2

Survivors bias: Dead patients can't report complications

3

Publication bias: 阴性研究尚未发表

观察性研究无法将药物的效果与患者的基线风险区分开来。

BART：随机真相

Blood Conservation Using Antifibrinolytics in a Randomized Trial

Outcome	Aprotinin	Alternatives
30-day mortality	6.0%	3.9%
Relative Risk	1.53 (53% increased death)

Trial Stopped Early for Harm

11 月从市场撤回2007

🔍

调查：评估偏差

您正在审查观察性研究。应用偏见风险思维：

Question	Observational	BART (RCT)
Random allocation?	❌ Surgeon choice	✓ Yes
Baseline comparable?	❌ Sicker got drug	✓ Balanced
Blinding?	❌ Open label	✓ Double-blind

Confounding by indication: 外科医生给病情最严重的患者注射了抑肽酶。观察性研究在测量生存偏差时将生存归因于药物。

Risk of Bias 2.0: The Five Domains

D1

Randomization Process

D2

与预期干预措施的偏差

D3

结果数据缺失

D4

结果测量

D5

报告结果的选择

ROBINS-I：对于非随机研究

当RCT不可用时，使用ROBINS-I（非随机研究中的偏倚风险）干预）

1

Confounding

Baseline differences between groups

2

Selection of Participants

Exclusions related to intervention

3

Classification of Interventions

Misclassification of exposure status

4

与预期干预措施的偏差

Co-interventions, contamination

5

Missing Data

Differential loss to follow-up

6

Measurement of Outcomes

Ascertainment bias

7

Selection of Reported Result

Selective reporting

Ratings: Low / Moderate / Serious / Critical / No information

故事：抑肽酶 BART 试验

当 64 项研究结果一致但全部错误时会发生什么？

REAL DATA

抑肽酶用于心脏手术以减少出血， 20 years. 64 small randomized trials 表明它是安全有效的。荟萃分析证实了益处。然后是 BART trial (2008) randomized 2,331 patients: aprotinin vs. tranexamic acid vs. aminocaproic acid. Result: aprotinin increased mortality by 53% （RR 1.53，95% CI 1.06-2.22）。该试验因伤害而提前停止。拜耳在几个月内将抑肽酶从市场上撤回。

外科医生的证据：2006

您是一位选择抗纤溶药物的心脏外科医生。 64 项小型试验有利于抑肽酶，但没有一项能够检测死亡率。大型 RCT (BART) 正在招募。你等一下吗？

路径A：相信荟萃分析

64 trials can't all be wrong. Continue prescribing aprotinin.

↓

小试验测量的是出血，而不是死亡。没有一个具有足够的死亡率能力。荟萃分析汇集了动力不足的替代结果。

OUTCOME: Excess deaths in cardiac surgery patients

PATH B: Assess Risk of Bias First

使用 RoB 对所有 64 项试验进行评分。请注意，它们规模较小，使用替代结果，并且人员流失率很高。等待充分有力的随机对照试验。

↓

BART reveals the truth. Switch to safer alternatives.

OUTCOME: Lives saved by demanding adequately powered evidence

THE REVELATION

证据的数量并不等于质量。六十四项衡量错误结果的动力不足的试验并不比一项衡量死亡率的动力充足的试验更重要。偏倚风险评估不是一种形式——它是患者与来自小型替代驱动证据的误导性结论之间的屏障。

Sixty-four small trials measured bleeding, not death.

One adequately powered trial revealed 53% increased mortality.

证据的数量不能替代质量和功效。

Module 6 Quiz

1. Why did 64 small trials miss aprotinin's harm?

A. Underpowered for mortality; used surrogate outcomes

B. Confounding by indication

C. Outcome measured incorrectly

D. Follow-up too short

方法保护患者免受我们的信任。

模块 7：综合

异质性是一条消息，而不是噪音。

Magnesium 争议：1991-1995

When pooling leads us astray.

模块 7：综合

🎯 Learning Objectives

Calculate pooled effect sizes using fixed-effect and random-effects models
Choose between DerSimonian-Laird and HKSJ estimators appropriately
Interpret forest plots including weights, confidence intervals, and diamonds
Explain why small-study effects can mislead meta-analyses
应用原则：“异质性是一条消息，而不是噪音”噪音”

The Year: 1991

“你站在希望与证据的十字路口......”

Heart disease kills more people worldwide than any other cause. In 1991, a new hope emerges: Could something as simple and cheap as intravenous magnesium save lives after myocardial infarction?

生物学原理是合理的：

Magnesium stabilizes cardiac membranes, prevents arrhythmias, and vasodilates coronary arteries.

LIMIT-2：里程碑式的试验

Leicester Intravenous Magnesium Intervention Trial, 1992

2,316

Patients enrolled

24%

Mortality reduction

p = 0.04

Statistically significant

A cheap, safe intervention that could save 250,000 lives per year globally.

医学界震惊了。

The Meta-Analysis: 1993

Researchers pooled seven randomized trials of IV magnesium in MI:

Trial	Year	N	Odds Ratio
Morton 1984	1984	40	0.10
Rasmussen 1986	1986	273	0.35
Smith 1986	1986	400	0.48
Abraham 1987	1987	94	0.87
Shechter 1990	1990	103	0.27
Ceremuzynski 1989	1989	48	0.22
LIMIT-2	1992	2,316	0.74

🔍

Investigation Exercise: The Meta-Analyst's Dilemma

你是 1993 年 Cochrane 审稿人。您被要求综合镁对心肌梗死的证据。七个试验的数据就摆在您面前。

您看到这个森林图中的模式了吗？

Pooled OR = 0.44 (95% CI: 0.27–0.71)

55% mortality reduction! Publish in the Lancet?

但是等等...您注意到有关试验规模的任何信息吗？

警告标志

What should have given us pause?

1

Small sample sizes: Six of seven trials had <500 patients

2

Extreme effects: OR of 0.10 (90% reduction) is implausible for any drug

3

All positive: 阴性试验在哪里？文件抽屉问题...

4

Funnel asymmetry: Small trials showed much larger effects than larger ones

🔍

漏斗图测试

在汇总之前，我们必须检查发布偏差。让我们检查一下漏斗图。

年份：1995 年 — ISIS-4 报告

“然后真相就出来了……”

The Fourth International Study of Infarct Survival (ISIS-4) enrolled 58,050 patients across 1,086 hospitals in 31 countries.

58,050

Patients

2,216

Deaths in Mg group

2,103

Deaths in placebo

OR = 1.06 (95% CI: 1.00–1.12)

No benefit. If anything, a trend toward harm.

📊

前后：完整图片

看看当我们将大型试验添加到我们的森林图中时会发生什么...

BEFORE ISIS-4

7 small trials (N = 3,274)

OR = 0.44

Strong benefit signal

AFTER ISIS-4

8 trials (N = 61,324)

OR = 1.02

No effect

Why Did Small Trials Mislead?

1

Publication Bias

Small negative trials were never published—they sat in file drawers

2

Small-Study Effects

Smaller trials tend to show larger effects due to methodological weaknesses

3

Random High Bias

偶然，一些小试验达到了极端结果 - 并且这些结果被发表

4

Random-Effects Amplification

Random-effects models give more weight to small trials, amplifying bias

Fixed vs. Random Effects

Which model should you choose?

FIXED EFFECT MODEL

Assumes one true effect. Weights studies by inverse variance (precision). Large trials dominate.

Magnesium result: OR = 0.96 (p = 0.52)

RANDOM EFFECTS MODEL

Assumes distribution of effects. Gives more weight to small trials. Wider confidence intervals.

Magnesium result: OR = 0.59 (p = 0.01)

⚠️ 模型选择决定了结论！

随机效应并不能修复偏差；

镁的教训

1。在信任汇总估计之前，请检查发布偏差 。漏斗图和艾格检验是您的工具。

2. Be wary of small-study effects. If only small trials show benefit, wait for a large, well-conducted trial.

3. Model choice matters. 随机效应会放大有偏见的证据。考虑这两个模型并理解其含义。

4. One large trial can overturn many small ones. 这就是像 ISIS-4 这样的大型试验如此有价值的原因。

Researcher

荟萃分析中的特殊研究设计

并非所有 RCT 都使用标准平行组设计。汇集结果时，有两种常见的替代方案需要特殊处理：

1

Cluster-Randomized Trials

随机分组（医院、学校），而不是个人。 design effect = 1 + (m−1) × ICC 减少了有效样本大小。在合并之前将 N 除以设计效果，或使用试验中调整后的 SE。忽略聚类会人为地缩小 CI。

2

Crossover Trials

每位患者都会接受两种治疗。配对设计可减少方差，但您需要 within-patient correlation （或配对分析 SE）才能正确进行池化。使用并行组SE是保守的；使用了错误的 N 名重复计数患者。

请参阅 Cochrane 手册 v6.4，第 23 章了解详细公式和实例。

故事：早期表面活性剂逆转

如果结合研究的方式决定了治疗是否可以挽救生命或是否有效，该怎么办？无用？

REAL DATA

早产儿的早期表面活性剂得到了 6 small trials showing reduced mortality (RR 0.84). A fixed-effect meta-analysis confirmed benefit (p=0.04). But a random-effects model showed no significance (p=0.12) — the confidence interval crossed 1.0. Later, SUPPORT (2010) and VON (2012), two large pragmatic trials with ~2,000 neonates combined, found no benefit 早期与晚期表面活性剂的支持。基于小型试验和错误的模型，临床实践已发生变化。

新生儿科医生的模型选择：2005

您正在更新早期表面活性剂的 Cochrane 综述。六项小型试验显示了固定效应模型的好处。随机效应模型不显着。您报告哪一个？

PATH A: Report Fixed-Effect Only

Fixed-effect is significant. Report the positive result. Change practice.

↓

NICUs adopt early surfactant. Later trials show no benefit. Practice reverses.

OUTCOME: Years of unnecessary intubation of premature infants

PATH B: Report Both Models

显示 FE 和 RE 结果。标记重要性取决于模型选择。呼吁进行大型试验。

↓

Honest uncertainty. Large trials prioritized. True answer emerges faster.

OUTCOME: Premature babies spared unnecessary intervention

THE REVELATION

当结论因使用固定效应还是随机效应而改变时，结论是脆弱的。两者都报告。承认不确定性。请记住：小试验的脆弱结果并不意味着改变实践。

Module 7 Quiz

1。为什么镁荟萃分析显示 ISIS-4 没有发现的好处？

A. ISIS-4 方法有缺陷

B. Calculation error in meta-analysis

C. Publication bias in small trials

D. LIMIT-2 动力不足

2. What warning sign should have alerted reviewers to potential bias?

A. Asymmetric funnel plot (small trials showing larger effects)

B. Low heterogeneity (I² = 0%)

C. Strong biological plausibility

D. Too few trials to analyze

3. When publication bias is suspected, which model may amplify the bias?

A. Fixed effect model

B. Random effects model

C. Bayesian model

D. Network meta-analysis

Small trials can show false signals.

Large trials anchor the truth.

异质性是一条消息，而不是噪音。

模块 8：异质性

异质性是一条消息，而不是噪音。

ACCORD: 2008

当平均值掩盖真相时。

模块 8：异质性

🎯 Learning Objectives

计算和解释 I²、τ² 和预测区间
Apply ICEMAN criteria to assess subgroup credibility
Distinguish between clinical, methodological, and statistical heterogeneity
Conduct and interpret leave-one-out sensitivity analyses
Explain how ACCORD revealed differential effects across subgroups

The Year: 2008

“您即将见证历史上最令人震惊的试验终止之一......”

几十年来，糖尿病界有一个指导方针原理： lower blood sugar is better。具有里程碑意义的 DCCT (1993) 和 UKPDS (1998) 表明，强化血糖控制可减少微血管并发症 - 失明、肾衰竭、神经损伤。

逻辑外推：

If controlling glucose prevents complications, shouldn't intensive control prevent cardiovascular disease too?

ACCORD: Action to Control Cardiovascular Risk in Diabetes

The definitive test of intensive glucose control

10,251

Type 2 diabetics

HbA1c <6%

Intensive target

HbA1c 7-7.9%

Standard target

所有患者均患有具有高心血管风险的 2 型糖尿病 - 既定的心血管疾病或多种危险因素。该试验设计持续 5.6 年。

February 6, 2008

数据安全监测委员会召开紧急会议。

After 3.5 years, they make an unprecedented decision:

停止试验。

令人震惊的结果

Outcome	Intensive	Standard	HR (95% CI)
Primary CV endpoint	352 events	371 events	0.90 (0.78–1.04)
All-cause mortality	257 deaths	203 deaths	1.22 (1.01–1.46)
Severe hypoglycemia	10.5%	3.5%	3.0× higher

22% increase in mortality

54 excess deaths in the intensive arm

🔍

Investigation Exercise: The Clinician's Dilemma

您是一名内分泌科医生，治疗 500 名糖尿病患者。 ACCORD 结果已发布。对于一直在努力实现 HbA1c <6% 的患者，您有何建议？

强化控制对每个人都有害吗？还是只针对某些人？

亚组分析揭示：

Subgroup	Intensive HR	Interpretation
No prior CVD	1.00 (0.76–1.32)	No effect
Prior CVD	1.45 (1.15–1.84)	Significant harm
Baseline HbA1c <8%	1.02 (0.75–1.40)	No effect
Baseline HbA1c ≥8%	1.29 (1.03–1.60)	Harm

The average effect masked critical heterogeneity!

对于已确诊 CVD 或基线控制不佳的患者，强化治疗是有害的。

了解异质性：I² 及以上

当研究（或亚组）显示不同效果时，我们必须量化这种变化。

I² = 0–25%: 低异质性。各个研究的效果是一致的。

I² = 25–50%: Moderate. Look for sources of variation.

I² = 50–75%: Substantial. Consider whether pooling is appropriate.

I² = 75–100%: Considerable. A single pooled estimate may mislead.

但 I² 本身并不能告诉您要做什么，它表明您需要进一步研究。

Tau² (τ²)：研究间方差

虽然 I² 告诉您由于异质性而导致的方差比例，但 τ² 告诉您

I² (percentage)

“总方差的哪一部分是由于研究之间的真实差异造成的？”

Scale: 0% to 100%

τ² (absolute)

“研究之间的真实影响有多大差异？”

Same scale as the effect measure

Use τ² to calculate prediction intervals

预测区间显示您在新研究中预期的影响范围 - 通常比置信区间宽得多。

📊

The Prediction Interval: What ACCORD Really Tells Us

Consider a meta-analysis of intensive glucose control across multiple trials...

Confidence Interval

HR 1.10 (0.95–1.27)

“我们的平均效果的最佳估计”

Prediction Interval

HR 1.10 (0.70–1.73)

"The range of effects in a new setting"

预测区间既有利又有害！

In some settings, intensive control might help. In others, it could kill.

When Is a Subgroup Effect Credible?

Subgroup Credibility Criteria (adapted from ICEMAN, Schandelmaier 2020 & Sun 2012)

1

是否预先指定了亚组分析？

事后亚组容易发生数据挖掘

2

Is there a plausible biological rationale?

机制应清晰且独立于数据

3

Is the effect consistent across related outcomes?

如果死亡出现危害，那么心肌梗死、中风是否也会出现类似危害？

4

Is there independent replication?

亚组效应是否已在其他研究中得到证实？

ICEMAN Applied to ACCORD

Criterion	Assessment	Score
Pre-specified?	是——先前的CVD已在方案中	✓
Biological rationale?	Yes—hypoglycemia more dangerous with CVD	✓
Consistent outcomes?	Yes—CV mortality and all-cause mortality aligned	✓
Independent replication?	Partially—ADVANCE, VADT showed similar patterns	~

ICEMAN Rating: High Credibility

The differential harm in high-risk patients appears genuine.

临床意义

对于没有 CVD 的患者： Moderate glucose control (HbA1c ~7%) remains the goal. Intensive control may reduce microvascular complications.

对于已确诊 CVD 的患者： Avoid intensive targets. Hypoglycemia is dangerous for damaged hearts.

对于老年患者： Relaxed targets. Quality of life matters. Tight control causes falls, confusion, and excess mortality.

"One size fits all" treatment is not patient-centered medicine.

Meta-Regression: Explaining Heterogeneity

When heterogeneity is high, meta-regression can identify study-level covariates that explain variation.

THE QUESTION

效应大小是否随研究特征而系统变化？

Covariates

Year, dose, duration, baseline risk, study quality

Output

Regression coefficient (slope), R², residual heterogeneity

Caution

荟萃回归需要 ≥10 项研究每个协变量。由于研究较少，仅属探索性。生态谬误：研究水平的关联可能不适用于个人。

Example: In ACCORD, meta-regression might test if treatment effect varies by baseline HbA1c, showing harm concentrated in patients with very high levels.

故事：SPRINT 血压革命

What number saves lives? Who decides?

REAL DATA

几十年来，目标是：将血压治疗到 <140 mmHg systolic. Then came SPRINT (2015): 9,361 high-risk patients randomized to intensive (<120) vs standard (<140) targets. Intensive treatment reduced CV events by 25% and death by 27%. Trial stopped early for benefit. Guidelines changed worldwide.

Before SPRINT: The Guidelines Committee

您在 2014 年制定血压指南。多年来目标一直是 <140。您是否应该等待更好的证据？

PATH A: Maintain Status Quo

Keep <140 target (established practice, minimal controversy)

↓

Guidelines unchanged. Physicians continue treating to <140.

OUTCOME: Miss opportunity to prevent deaths

PATH B: Fund the Definitive Trial

更新目标之前等待 SPRINT 结果

↓

SPRINT demonstrates benefit. Update target to <120 for high-risk patients.

OUTCOME: Estimated 100,000+ lives saved globally

JNC 7 (2003): <140

Years of uncertainty

SPRINT (2015)：<120 表示高风险

THE REVELATION

“护理标准”未修复。当试验挑战假设时，它就会改变。十年来，患者可能没有得到充分的治疗，因为没有人测试这个明显的问题。

Module 8 Quiz

1。为什么 ACCORD 试验提前停止？

A. Intensive control showed clear cardiovascular benefit

B. Intensive control increased mortality

C. 入组太慢

D. Budget ran out

2. What does a prediction interval tell us that a confidence interval doesn't?

A. The true effect is more precisely estimated

B. 样本量足够

C. 我们在新研究中预期的效果范围

D. 使用的数学公式

3. According to ICEMAN, which factor is MOST important for subgroup credibility?

A. 预先规范亚组假设

B. Large sample size in the subgroup

C. Statistically significant p-value

D. Multiple outcomes showing same direction

当研究存在分歧时，

听取分歧。

异质性是一条消息，而不是噪音。

缺乏证据并不等于不存在。

模块 9：隐藏的研究

缺乏证据并不等于不存在。

Reboxetine: 2010

从未见过光的 74%。

模块 9：隐藏的研究

🎯 Learning Objectives

Interpret funnel plots for asymmetry detection
应用艾格检验和其他统计发表偏倚测试
实施偏倚调整的修剪和填充方法
Critically appraise the limitations of publication bias tests
应用原则：“没有证据并不等于不存在”

The Year: 1997

"A new hope for depression patients who cannot tolerate SSRIs..."

瑞波西汀（Edronax）是一种新型抗抑郁药——一种选择性去甲肾上腺素再摄取抑制剂（NRI）。与 SSRI 不同，它针对的是不同的神经递质系统。对于失败或不能耐受氟西汀或舍曲林的患者，它提供了一种新机制。

1997

EU approval

50+

Countries approved

Millions

Prescriptions written

已发表的证据

What doctors could find in medical journals:

Comparison	Published Trials	Published Result
Reboxetine vs Placebo	3 trials (n=507)	Significantly better (SMD = 0.56)
Reboxetine vs SSRIs	4 trials (n=628)	Equivalent or better

已发表的文献讲述了一个清晰的故事：

Reboxetine works. Patients benefit. Prescribe with confidence.

但是您看不到的试验又如何？

In 2010, German researchers at IQWiG made a request to the European Medicines Agency...

They demanded access to all 试验数据 — 已发表且未发表。

What they found changed everything.

完整图片

Eyding et al., BMJ 2010

Comparison	Published Only	ALL DATA
Reboxetine vs Placebo	SMD 0.56 (benefit)	SMD 0.10 (no benefit)
Patients in analysis	507 (14%)	2,731 (100%)
Reboxetine vs SSRIs	Equivalent	较差（危害 RR 1.23）
Patients in analysis	628 (26%)	2,411 (100%)

74% 的患者数据从未发表

隐藏的试验没有显示任何益处，而且危害更大

🔍

Investigation Exercise: The File Drawer

您是2008 年系统审稿人。您可以在 PubMed、Embase 和 Cochrane 图书馆中搜索所有瑞波西汀试验。您发现 7 项已发表的试验显示出益处。

您能相信这个证据吗？

⚠️ 漏斗完全不对称！

所有已发表的研究都集中在一侧。无效试验和阴性试验在哪里？

发表偏差工具包

1

Funnel Plot

Plot effect size vs. standard error. A symmetric funnel suggests no bias; asymmetry raises alarms.

2

Egger's Regression Test

Regress effect/SE on 1/SE. A non-zero intercept (P < 0.10) suggests small-study effects. Note: inflated false-positive rate with binary outcomes; use Peters' test instead.

3

Peters' Test

For binary outcomes, regresses log OR on inverse of total sample size. Less prone to false positives.

4

Trim-and-Fill

估算“缺失”研究以使漏斗对称，然后重新计算汇总效应。

📊

交互式：修剪和填充分析

让我们应用对瑞波西汀数据进行修剪和填充，看看调整后的估计值是什么...

Published Only

7 trials

SMD = 0.56

Significant benefit

Trim-and-Fill

7 + 5 imputed = 12 trials

SMD = 0.23

Reduced, still nominally significant

But even trim-and-fill underestimated the problem!

所有数据的真实效果是 SMD = 0.10（本质上为空）。
Trim-and-fill is conservative—it doesn't fully correct for selective publication.

The Best Defense: Trial Registries

发表偏差检测方法不完善。真正的解决方案是 prospective registration.

ClinicalTrials.gov

US registry (2000)

WHO ICTRP

Global portal

PROSPERO

Review registration

搜索试验时，始终检查注册表。将 registered 试验次数与 published次数进行比较。这个差距就是你的警告信号。

Since 2005, ICMJE requires trial registration as a condition of publication.

AllTrials 活动

"All trials registered. All results reported."

瑞波西汀丑闻以及其他药物的类似案例，催化了一场全球运动：

✓

2013 年：EMA 临床数据政策

European Medicines Agency commits to publishing clinical study reports

✓

2016: FDA Amendments Act enforcement

Mandatory results reporting on ClinicalTrials.gov within 12 months

✓

AllTrials Coalition

Over 90,000 supporters, 700+ organizations demanding transparency

瑞波西汀后果

!

Germany's IQWiG recommended against reboxetine for depression

!

英国 NICE 将其降级为“不推荐”

!

FDA 在 2001 年拒绝了瑞波西汀（他们可以访问未发表的数据）

十多年来，患者接受的药物并不比瑞波西汀更好安慰剂。

因为只发表了阳性试验。

故事：帕罗西汀研究 329 欺骗

如果发表的结论与实际数据相反怎么办？

REAL DATA

葛兰素史克测试的研究 329帕罗西汀 adolescent depression。已发表的论文 (2001) 得出的结论是帕罗西汀为 "generally well tolerated and effective." 实际数据：帕罗西汀 failed on all 8 pre-specified outcomes. When re-analyzed (RIAT 2015), suicidal/self-harm events: 帕罗西汀组为 23 例，安慰剂组为 5 例。发表的论文重新定义了事后结果以产生生产意义。 2015年，RIAT（恢复隐形和放弃的试验）使用原始临床研究报告重新分析得出结论：帕罗西汀是 neither safe nor effective for adolescents.

处方者的困惑： 2003

您是一名儿童精神科医生。研究 329——唯一的大型试验——表明帕罗西汀对青少年有效。但 FDA 尚未批准其用于青少年。家长要求您开处方。你做什么？

路径 A：信任出版物

A peer-reviewed JAACAP paper says it works. Prescribe off-label.

↓

Millions of prescriptions worldwide. Suicidal events in adolescents.

OUTCOME: FDA issues black box warning for SSRIs in youth (2004)

PATH B: Check the Trial Registry

在 ClinicalTrials.gov 中搜索原始终点。请注意，发布的结果与注册的协议不匹配。

↓

红色标志：检测到结果切换。你扣留药物。患者更安全。

OUTCOME: Publication bias identified before harm

THE REVELATION

发表偏倚不仅仅与缺失的研究有关。这是关于已发表的研究中缺失真相的问题。结果转换、代写和选择性报告可以将失败的试验变成营销工具。始终将发布的结果与试验注册协议进行比较。

Module 9 Quiz

1.已发表的文献中隐藏了多少比例的瑞波西汀试验数据？

A. 25%

B. 50%

C. 74%

D. 90%

2. Why can trim-and-fill underestimate the correction needed?

A. It assumes effects are normally distributed

B. 它仅估算研究以实现对称性，这可能无法完全反映现实

C. 它需要至少 20 项研究

D. 它只适用于非常大的研究

3. What is the best prospective defense against publication bias?

A. Funnel plots in all meta-analyses

B. Egger's test before pooling

C. Prospective trial registration

D. More medical journals

你不能做的事情请参阅

may be more important than what you can.

缺乏证据并不等于不存在。

Certainty must be earned, not assumed.

模块 10：确定性

Certainty must be earned, not assumed.

Early Surfactant: 2012

当高质量证据出现时。

模块 10：确定性

🎯 Learning Objectives

应用完整的 GRADE 框架来评估证据
Evaluate all five downgrade factors (RoB, inconsistency, indirectness, imprecision, publication bias)
Identify when to upgrade for large effect, dose-response, or confounding
Construct Summary of Findings tables with absolute effect estimates
应用原则：“确定性必须赢得，而不是假设”

The Year: 1990s

"A revolution in neonatal care..."

呼吸窘迫综合征（RDS）是早产儿死亡的主要原因。外源性 surfactant——一种防止肺泡塌陷的物质——的开发是新生儿医学的伟大进步之一。

问题变成了：我们什么时候应该给予表面活性剂？

Prophylactically (to all high-risk infants) or selectively (only after RDS develops)?

原始 Cochrane 综述 (2003)

Multiple RCTs conducted before the era of routine CPAP

Outcome	Prophylactic vs Selective	Certainty
Neonatal mortality	RR 0.73 (favors prophylactic)	High
BPD or death	RR 0.84 (favors prophylactic)	High

Recommendation: Give surfactant prophylactically

Guidelines worldwide adopted this approach

但是新生儿护理世界正在发生变化...

A new technology emerged: Continuous Positive Airway Pressure (CPAP)

Non-invasive support that could help preterm lungs without intubation.

旧的证据仍然适用吗？

2012 Cochrane 更新

New trials conducted in the CPAP era

Outcome	Old Trials	New Trials
BPD or death	RR 0.84 (favors prophylactic)	RR 1.12 (favors selective)
机械通气的需求	预防性用药降低	使用预防性用药更高预防性的！

Complete Reversal

In the CPAP era, prophylactic surfactant causes more harm

🔍

Investigation: Why Did Evidence Evolve?

您是一名新生儿科医生。一位同事问：“随机试验怎么会相互矛盾？”

最初的证据是错误的吗？

1

Indirectness Changed

Old trials: No CPAP available. New trials: CPAP standard of care.

2

比较器改进

Selective surfactant + CPAP is better than prophylactic intubation.

3

Context Matters

一个时代的证据可能不适用于

This is why GRADE assesses Indirectness!

High-quality evidence can become inapplicable when context changes.

GRADE 框架

Grading of Recommendations, Assessment, Development and Evaluations

GRADE 回答了问题： 我们对此估计的信心有多大？

⊕⊕⊕⊕ HIGH: Very confident. True effect is close to the estimate.

⊕⊕⊕◯ MODERATE: Moderately confident. True effect likely close, but may differ substantially.

⊕⊕◯◯ LOW: Limited confidence. True effect may differ substantially.

⊕◯◯◯ VERY LOW: Very little confidence. True effect likely substantially different.

GRADE: Factors That Downgrade Certainty

随机对照试验证据从高水平开始。可以将其降级为：

1

Risk of Bias

Flawed randomization, lack of blinding, incomplete follow-up, selective reporting

2

Inconsistency

Unexplained heterogeneity across studies (large I², non-overlapping CIs)

3

Indirectness

人口、干预、比较器或问题结果的差异

4

Imprecision

Wide confidence intervals, small sample size, few events

等级：第五个因素

5

Publication Bias

Asymmetric funnel plot, missing registered trials, sponsor influence

Each factor can downgrade by one or two levels

High → Moderate → Low → Very Low

Example: 具有高偏倚风险（↓1）和严重间接性（↓1）的随机对照试验（从高开始）的荟萃分析将是评级 LOW.

📊

Interactive: Apply GRADE to Surfactant

让我们使用旧试验与新试验来评估预防性表面活性剂证据的确定性。

OLD TRIALS (Pre-CPAP)

Starting: HIGH (RCTs)

Risk of Bias: Low (−0)

Inconsistency: None (−0)

Indirectness: Serious (−1)

Different standard of care today

Final: ⊕⊕⊕◯ MODERATE

NEW TRIALS (CPAP Era)

Starting: HIGH (RCTs)

Risk of Bias: Low (−0)

Inconsistency: None (−0)

Indirectness: None (−0)

Matches current practice

Final: ⊕⊕⊕⊕ HIGH

GRADE: Factors That Upgrade Certainty

观察证据从低开始。它可以升级为：

+1

Large Magnitude of Effect

RR >2 或 <0.5，没有合理的混杂

+1

Dose-Response Gradient

Higher exposure = larger effect in a consistent pattern

+1

Residual Confounding

All plausible confounders would reduce the effect (strengthens causal inference)

Communicating Certainty

GRADE requires transparent language about confidence:

HIGH: "Prophylactic surfactant reduces mortality..."

MODERATE: "Prophylactic surfactant probably reduces mortality..."

LOW: "Prophylactic surfactant may reduce mortality..."

VERY LOW: "We are uncertain whether prophylactic surfactant reduces mortality..."

这种语言可确保临床医生了解证据的强度。

故事：早产儿缺氧悖论

Can too much of a lifesaver become a killer?

REAL DATA

1940s-50s: High oxygen concentrations saved premature babies from respiratory failure. Then came an epidemic of blindness—retrolental fibroplasia (now called ROP). Doctors reduced oxygen dramatically. Blindness dropped. But then: increased deaths and brain damage 来自缺氧。所需的最佳氧气水平 decades of trials to find. Recent SUPPORT/BOOST II trials finally defined the therapeutic window: SpO2 91-95%.

新生儿科医生的困境：1955

您是一名新生儿科医生。早产儿在高氧环境下会失明。你做什么？

PATH A: Dramatic Reduction

Drastically reduce oxygen to prevent blindness

↓

Blindness rates drop. But some babies die or suffer brain damage from hypoxia.

OUTCOME: Trading one harm for another

路径B：系统研究

仔细滴定氧气，研究剂量反应关系

↓

Takes decades but eventually identifies the optimal range.

OUTCOME: Optimize both survival and vision

1940s: High O2 saves lives

1950s: Blindness epidemic

1960-70年代：低O2死亡

2010s: SUPPORT/BOOST define optimal range

THE REVELATION

每项干预措施都有一个治疗窗口。找到它需要测量，而不是假设。钟摆摆动了 60 年，才有证据确定平衡。

Module 10 Quiz

1。为什么表面活性剂推荐在 2003 年至 2012 年间出现逆转？

A. 最初的试验是欺诈性的

B. CPAP changed the comparator (indirectness)

C. Not enough patients in original trials

D. 结果测量方式不同

2。以下哪一项不是成绩降级因素？

A. Risk of bias

B. Imprecision

C. Publication bias

D. Large magnitude of effect

3.低确定性证据应使用什么语言？

A. “干预减少...”

B. “干预可能减少...”

C. “干预可能减少...”

D. “我们不确定是否...”

数字是还不够。

您必须表达您的确定性。

Certainty must be earned, not assumed.

方法保护患者免受我们的信任。

模块 11：生活回顾

方法保护患者免受我们的信任。

COVID-19 Hydroxychloroquine: 2020

当紧急情况满足证据时。

模块 11：生活回顾

🎯 Learning Objectives

申请进行试验序贯分析以确定证据何时充足
设计并维护实时系统评价
Establish update triggers and futility/harm boundaries
Manage multiplicity and alpha-spending in sequential analyses
Explain how rapid evidence synthesis evolved during COVID-19

March 2020: A World in Crisis

“病毒传播速度超出了我们的理解......”

COVID-19 导致数千人死亡。重症监护室人满为患。没有疫苗，没有治疗方法。然后一线希望： hydroxychloroquine (HCQ)—an old malaria drug—showed antiviral activity in lab studies.

March 20

Gautret 研究（法国）

36 pts

Non-randomized

Viral

Clearance improved

急于采用

Gautret 研究几周内：

!

March 28: FDA issues Emergency Use Authorization for HCQ

!

April 4: India bans HCQ export (hoarding fears)

!

Global: Shortages affect lupus and rheumatoid arthritis patients

Millions received HCQ based on a 36-patient observational study

What could go wrong?

🔍

调查：Gautret 研究

您是EBM 专家要求评估法国的 HCQ 研究。检查设计...

Issue	Impact
Non-randomized	Selection bias—who got HCQ?
6 patients excluded	3 went to ICU, 1 died, 1 withdrew, 1 had nausea
Surrogate outcome	Viral load, not clinical outcomes
来自不同医院的对照	Different care, different testing
No blinding	Expectation bias in lab testing

这项研究将在 RoB 2.0 上获得高偏倚风险

GRADE certainty: VERY LOW. Yet it changed global policy.

Why Observational COVID Studies Misled

1

Immortal Time Bias

Patients must survive long enough to receive treatment. Survivors are compared to non-survivors.

2

Confounding by Indication

Sicker patients may get different treatments. Healthier patients received HCQ early.

3

Healthy User Effect

Patients who seek treatment tend to be healthier overall.

4

Outcome Reporting

具有阳性结果的研究更快地发表。

2020 年 6 月：RCT 报告

Large, rigorous trials completed at remarkable speed

Trial	N	Result
RECOVERY (UK)	4,716	No benefit on mortality (RR 1.09)
WHO SOLIDARITY	954	No benefit (RR 1.19)
ORCHID (US)	479	停止徒劳

HCQ provided no benefit—and may have caused harm

June 15, 2020: FDA revokes Emergency Use Authorization

📊

时间线：观察与 RCT 证据

March-May 2020

Observational: ~20 studies

Suggest benefit

Pooled OR ~0.65

June-July 2020

RCTs: RECOVERY, SOLIDARITY

Show no benefit/harm

Pooled RR ~1.10

在 3 个月内从“有希望”到“无效”

这就是为什么我们需要随机化和实时回顾来跟踪不断变化的证据。

Living Systematic Reviews

快速发展的新方法证据：

1

Continuous Surveillance

每周甚至每天搜索文献以获取新证据

2

Cumulative Meta-Analysis

Update pooled estimates as each new trial reports

3

试验序贯分析 (TSA)

Determine when sufficient information has accumulated to conclude

4

Transparent Versioning

Track every change, maintain full audit trail

试验序贯分析 (TSA)

When have we learned enough?

TSA 将停止边界应用于荟萃分析 - 类似于单个试验中的中期分析。它解释了 required information size (RIS) needed to detect or exclude a clinically meaningful effect.

RIS

Required sample size

α-spending

Controls type I error

Boundaries

Benefit / Harm / Futility

对于新冠病毒中的 HCQ，TSA 显示到 2020 年 6 月已经跨越了无效边界。

HCQ 传奇

1. Observational studies can mislead spectacularly 偏见盛行时的经验教训。即使许多指向同一方向的研究也可能是错误的。

2. RCTs can be conducted quickly when the will exists. RECOVERY enrolled 5,000+ patients in weeks.

3。实时回顾至关重要 for evolving topics. Fixed-point-in-time reviews become obsolete instantly.

4. Political pressure doesn't change biology. 即使在压力下，严格的方法也能保护患者。

故事：LEAP 花生过敏革命

如果预防就是原因怎么办？

REAL DATA

For decades, pediatric guidelines recommended: avoid peanuts in infancy to prevent allergy. Meanwhile, peanut allergy rates tripled 从 1997 年到 2008 年。然后来了 LEAP (2015): 640 high-risk infants randomized to early peanut introduction vs. avoidance. Result: Early introduction reduced peanut allergy by 81% （1.9% vs 13.7%）。预防策略导致了流行病。

过敏症专科医生的十字路口：2010

您是一名儿科过敏症专科医生。尽管有避免指南，但花生过敏仍在增加。您质疑教条吗？

PATH A: Follow Guidelines

Continue recommending peanut avoidance in high-risk infants

↓

Guidelines are "evidence-based." Safe to follow consensus.

OUTCOME: Peanut allergies continue to rise

路径 B：质疑教条

Design a trial to test if early introduction might be protective

↓

LEAP trial reveals the truth. Guidelines reverse worldwide.

OUTCOME: Prevent an epidemic

2000: AAP recommends avoidance

2008: Allergy rates triple

2015：LEAP 推翻证据

2017: Guidelines flip to early introduction

THE REVELATION

“首先，不要伤害”需要证据。假设，即使是善意的假设，也可能造成大规模伤害。免疫系统需要暴露才能产生耐受性——避免会产生过敏。

Module 11 Quiz

1。 Gautret 羟氯喹研究的主要缺陷是什么？

A. Too few patients

B. No blinding

C. Excluding patients who deteriorated

D. Too short follow-up

2. What does Trial Sequential Analysis help determine?

A. Which studies have high risk of bias

B. When enough evidence has accumulated

C. 异质性程度

D. Which treatment is best

3。为什么观察性新冠肺炎研究显示 HCQ 有益，而随机对照试验却没有？

A. RCTs enrolled sicker patients

B. RCTs used different outcomes

C. 观察性研究存在偏差

D. 观察性研究有更好的数据

Speed cannot replace rigor.

But rigor can be fast.

Living reviews balance both.

并非每个信号都是真实的。

模块 12：高级方法

并非每个信号都是真实的。

Advanced Methods

Beyond pairwise meta-analysis.

模块 12：高级方法

🎯 Learning Objectives

Interpret network meta-analysis geometry and SUCRA rankings
Apply bivariate models for diagnostic test accuracy meta-analysis
Conduct dose-response meta-analysis with flexible splines
Understand when individual patient data (IPD) meta-analysis is needed
认识每个高级方法的假设和局限性方法

当配对不够时

“有时问题比 A 与 B 更复杂......”

您学到的方法构成了基础。但临床现实往往要求更多： Which of 10 antidepressants is best? What's the optimal dose of statin? Does this test accurately diagnose early cancer?

该模块介绍了四种高级方法 - 每种方法回答不同的复杂问题。

Network Meta-Analysis (NMA)

When you have many treatments but few head-to-head trials

NMA combines direct evidence (A vs B) with indirect evidence (A vs C, B vs C → inferred A vs B) to compare multiple treatments simultaneously.

SUCRA

Ranking probabilities, not effect size

Consistency

Direct = Indirect?

Networks

Visualize evidence

🔍

NMA Example: Antidepressants

The landmark Cipriani 2018 NMA compared 21 antidepressants using 522 trials.

The Challenge

21 drugs, but not every pair tested head-to-head

Many vs. placebo, few vs. each other

The Solution

NMA 结合了整个网络的直接和间接证据

对所有 21 种方法的有效性和可接受性进行排名

结果：一些药物的疗效排名较高，另一些药物的可接受性排名较高

没有一种药物是普遍“最好”的；通过可信区间、传递性和临床权衡来解释排名。

NMA: Critical Assumptions

1

Transitivity

Effect modifiers should be similarly distributed across comparisons; otherwise indirect comparisons may be biased

2

Consistency

直接和间接证据一致（可测试）

3

Connected Network

All treatments linked through at least one common comparator

When assumptions fail, NMA can mislead

始终评估传递性并测试不一致。

Dose-Response Meta-Analysis

寻找最佳剂量

Uses the Greenland-Longnecker method 使用受限三次样条来模拟剂量和效果之间的非线性关系。

1

Non-linear patterns

J-shaped (alcohol & mortality), U-shaped (vitamin D), threshold (aspirin)

2

Clinical relevance

找到具有最佳效益-危害平衡的剂量，而不仅仅是“越多越好”

个体患者数据 (IPD)

亚组分析的黄金标准

Instead of published summary data, obtain 原始患者水平来自试用者的数据 。实现精确的亚组分析、事件时间建模和标准化定义。

One-Stage

Single hierarchical model (not mega-trial)

Two-Stage

Analyze, then pool

80%+ target

数据可用性目标

早期乳腺癌试验者协作小组在 20 世纪 80 年代率先提出了 IPD MA。

Diagnostic Test Accuracy (DTA)

当“干预”是测试

DTA meta-analysis synthesizes sensitivity （真阳性率）和 specificity (true negative rate)—two correlated outcomes requiring bivariate models.

1

Bivariate/HSROC Model

说明敏感性和特异性之间的相关性

2

SROC Curve

具有 95% 置信度的 ROC 曲线汇总和预测区域

3

QUADAS-2

Quality Assessment of Diagnostic Accuracy Studies

选择正确的方法

Question	Method
Does A beat B?	Pairwise MA
Which of many treatments is best?	Network MA (NMA)
最佳方法是什么剂量？	Dose-Response MA
Who benefits most? (subgroups)	IPD MA
此测试的准确度如何？	DTA MA
效果如何随时间变化？	Survival/Time-to-Event MA

方法必须与问题匹配。永远不要用错误的方法提出问题。

故事：脓毒症传奇中的类固醇

Three large trials. Three different answers. What do you believe?

REAL DATA

CORTICUS (2008): 499 patients. Hydrocortisone in septic shock. No mortality benefit. ADRENAL (2018): 3,658 patients. Hydrocortisone. No mortality benefit. APROCCHSS (2018): 1,241 patients. Hydrocortisone + fludrocortisone. Mortality reduced (43% vs 49.1%, p=0.03). Same class of intervention. Different protocols. Different results.

指南编写者的挑战

您正在编写脓毒症指南。三个主要试验的结果不一致。您如何推荐？

PATH A: Simple Average

Pool all three trials. Overall effect uncertain. Conclude "evidence unclear."

↓

Guidelines say steroids are optional. No strong recommendation.

OUTCOME: Clinicians left without clear guidance

PATH B: Investigate Heterogeneity

Analyze why APROCCHSS differed (fludrocortisone, longer duration, different population)

↓

识别有效协议与无效协议的区别。

OUTCOME: Recommend the specific effective protocol

THE REVELATION

冲突的试验不是失败。它们是治疗有效和无效的地方的地图。试验之间的差异（剂量、持续时间、联合干预措施、人群）是理解的关键。

Module 12 Quiz

1。网络荟萃分析相对于成对分析的主要优势是什么？

A. 不需要数据提取

B. It compares treatments not directly tested against each other

C. 它消除了偏倚评估风险

D. It produces better forest plots

2. Why does DTA meta-analysis require bivariate models?

A. To handle more than two studies

B. 调整发表偏倚

C. 敏感性和特异性是相关

D. To generate forest plots

3. What does the "consistency" assumption in NMA require?

A. All studies must be high quality

B. 直接和间接证据必须一致

C. Sample sizes must be similar

D. No missing studies

Methodologist

课程生态系统

本课程涵盖完整的系统审核工作流程。如需深入了解，请探索配套课程：

DTA Course
Bivariate/HSROC, SROC curves, QUADAS-2

Risk of Bias Mastery
RoB 2, ROBINS-I/E, domain-level assessment

GRADE Certainty
Full SoF tables, GRADE-CERQual

IPD Meta-Analysis
One-stage/two-stage, mixed-effects models

Publication Bias Detective
Copas, PET-PEESE, p-curve, selection models

Umbrella Reviews
AMSTAR 2, ROBIS, overlap correction

Prognostic Reviews
CHARMS, PROBAST, c-statistic pooling

Living Reviews + Rapid Reviews
TSA, update triggers, abbreviated methods

Module 12 Complete

“方法必须与问题相匹配。高级方法可以回答高级问题，但基本原理永远不会改变。”

您已经掌握了核心工作流程。接下来的十个模块探索前沿：贝叶斯推理、网络荟萃分析、个体患者数据、剂量反应模型、稳健性和脆弱性、公平性、人工智能辅助合成、定性证据、多变量方法和再现性。

并非每个信号都是真实的。

模块 13：贝叶斯 Turn

并非每个信号都是真实的。

模块 13：贝叶斯 Turn

🎯 Learning Objectives

Explain the频率论和贝叶斯推理之间的差异
Interpret prior distributions, likelihoods, and posterior distributions
Distinguish credible intervals from confidence intervals
Understand when Bayesian meta-analysis offers advantages
Recognize how prior choice affects conclusions

故事开场白：STAMPEDE

In 2005, a trial began

that would never truly end.

前列腺癌的 STAMPEDE 试验采用了多臂、多阶段 (MAMS) 平台设计。随着证据的积累，武器可能会增加或减少。尽管其统计数据是频率论的，但自适应哲学体现了贝叶斯精神：随着数据积累更新决策。

频率论世界观

In frequentist statistics, probability means long-run frequency。 95% CI 并不意味着“真实效应在内部的可能性为 95%”。这意味着：如果我们无限重复研究，95% 的区间将包含真相。

p-value

P(data | H₀)，而不是 P(H₀ | data)

95% CI

覆盖属性，而不是信念

Fixed

真实参数是固定的

贝叶斯世界观

In Bayesian statistics, probability represents degree of belief. We start with a prior （数据之前我们相信的），用 likelihood （数据告诉我们的）更新，并得到 posterior (updated belief).

1

Prior × Likelihood = Posterior

贝叶斯'定理： P(θ|data) ∝ P(data|θ) × P(θ)

2

Credible Intervals

95% 的可信区间在概率上是可解释的，以指定模型和先验为条件。

Researcher

Choosing Priors

1

Non-informative (Vague)

Normal(0, 10000) 或均匀。让数据占主导地位。模仿频率论结果。

2

Weakly Informative

Normal(0, 1) for log-OR. Regularizes extreme estimates while remaining flexible.

3

Informative

Based on previous evidence. Powerful but controversial. Must be pre-specified.

4

Half-Cauchy for τ

Recommended for heterogeneity. Half-Cauchy(0, 0.5) allows large τ but concentrates near zero.

Researcher

MCMC Sampling

Most Bayesian models cannot be solved analytically. We use Markov Chain Monte Carlo (MCMC) 从后验中抽取样本。工具：JAGS、Stan、brms (R)、PyMC (Python)。

Chains

Multiple independent chains (typically 4)

R̂

Convergence: R̂ < 1.01 (strict; older texts use < 1.1)

ESS

Bulk-ESS > 400 表示平均值；对于 CI

Methodologist

Bayesian Model Averaging

Instead of choosing between fixed-effect and random-effects models, Bayesian model averaging ，tail-ESS > 400 (BMA) 按后验概率对每个模型进行加权。这解释了最终估计中的模型不确定性。

BF

Bayes Factors

BF₁₀ > 10 = H₁ 的有力证据。 BF₁₀ < 1/10 = H₀ 的有力证据。

交互式工具占位符

Interactive: Posterior Visualizer

调整先验强度以查看其如何影响后验。观看更多数据如何压倒先前的数据。

Prior Strength: Vague

Prior Mean (log-OR): 0.00

STAMPEDE 故事

STAMPEDE 于 2005 年推出，有 5 个研究部门比较晚期前列腺癌的治疗方法。到 2016 年，它添加了阿比特龙，死亡率降低了 37%（HR 0.63，95% CI 0.52–0.76）。

平台设计体现了贝叶斯适应性思维：中期分析指导治疗组选择，新治疗组可以随着治疗的出现而进入，无效的治疗组尽早退出——使患者免于无效

STAMPEDE 招募了 100 多个中心的 10,000 多名患者，从根本上改变了前列腺癌护理。贝叶斯思维方式可以积累证据并实时为决策提供信息。

Decision Tree: When to Go Bayesian?

Frequentist vs Bayesian Meta-Analysis

在以下情况下选择贝叶斯：(1) 您拥有真实的先验信息，(2) 您需要概率陈述（“80% 机会效应 > 0”），(3) 很少有研究表明频率论属性不可靠，或者 (4) 您想要进行模型平均。

Bayesian with weakly informative prior

A common practical default. Regularizes extreme estimates without forcing strong prior conclusions.

贝叶斯与信息性先验

仅当先验证据强大且预先指定时。必须进行敏感性分析。

Stay frequentist

Simpler, well-understood. Preferred when k is large and no prior information.

Remember Module 1?

CAST Through a Bayesian Lens

如果使用来自基础科学的信息先验（抗心律失常药物抑制 PVC）对 CAST 进行贝叶斯分析，后验仍然会强烈转向伤害。有了足够的数据，即使是强大的先验也会屈服于可能性。教训：贝叶斯方法不能防止不良先验，但它们做出假设 transparent.

Module 13 Quiz

Q1. What does a 95% Bayesian credible interval mean?

A. 95% of repeated experiments would produce intervals containing the true value

B. 真实参数有 95% 的概率位于此区间内

C. The interval has a 95% chance of being correct

D. 未来数据的 95% 将落入此范围

Q2. 研究间异质性的建议先验是什么(τ)?

A. Uniform(0, 100)

B. Normal(0, 1)

C. Half-Cauchy(0, 0.5)

D. Fixed at 0.5

Module 13 Complete

“贝叶斯转向与数学无关。它与诚实有关，使我们的假设可见。”

并非每个信号都是真实的。

模块 14：网络

方法保护患者免受我们的信任。

模块 14：网络

🎯 Learning Objectives

Explain why pairwise comparisons are insufficient when many treatments exist
Interpret network geometry (nodes, edges, thickness)
了解传递性、一致性以及间接证据
Interpret SUCRA rankings and league tables
Recognize when NMA assumptions are violated

A clinician faces a patient

与抑郁症的作用。哪种药物？

常用的抗抑郁药有 21 种。大多数头对头试验仅比较 2 或 3 个。 Cipriani 等人。 (2018, Lancet) 将 522 项试验和 116,477 名患者连接到一个网络中。

网络元分析的逻辑

1

Direct Evidence

Trials directly comparing A vs B give the most reliable estimate.

2

Indirect Evidence

如果 A vs C 和 B vs C 存在，我们可以推断 A vs B。这是“传递”假设。

3

Mixed Evidence

NMA combines both, weighted by precision, to rank all treatments simultaneously.

Interactive: Network Graph

每个节点都是一个治疗。边缘厚度代表比较这两种处理的研究数量。

Researcher

Transitivity & Consistency

Transitivity：间接估计（通过通用比较器）应近似于直接估计。这要求效果修饰符在比较中类似地分布。

Consistency：比较直接和间接证据的统计测试。全局（治疗交互设计）和局部（节点分裂）测试有助于识别不一致循环。

Researcher

SUCRA & P-scores

SUCRA

累积排名下的表面。值越高表示排名概率越高，但不保证优越性。

P-score

频率论类似于排名概率摘要。用效应大小和不确定性进行解释。

Caution: Ranking is seductive but misleading when differences between treatments are small or uncertain. Always report credible/confidence intervals alongside ranks.

Methodologist

Component NMA

When interventions are complex (e.g., behavioral + pharmacological), component NMA decomposes multi-component treatments to estimate the individual contribution of each component. Uses additive models: effect(A+B) = effect(A) + effect(B) + interaction.

Cipriani 网络

2018 年《柳叶刀》分析发现，所有 21 种抗抑郁药都比安慰剂更有效。阿米替林、米氮平和文拉法辛的疗效排名最高。阿戈美拉汀、氟西汀和艾司西酞普兰在可接受性方面排名最高（最少退出率）。

没有任何一种药物在所有结果上都“获胜”。该网络揭示了成对分析中看不见的权衡。

Decision Tree: Is NMA Appropriate?

NMA Feasibility Check

您有 15 项 RCT 比较 6 种不同的他汀类药物。有些配对有直接证据，有些则没有。

Check transitivity, then fit NMA

验证患者群体和研究设计在比较中是否足够相似。

忽略间接证据

失去统计功效并在证据基础中留下空白。

Pool all into one pairwise comparison

违反证据结构。他汀类药物是不同的药物。

Module 14 Quiz

Q1. 要使间接证据在 NMA 中有效，必须满足什么假设？

A. Transitivity — effect modifiers are balanced across comparisons

B. Homogeneity — I² must be below 25%

C. All studies must have similar sample sizes

D. 所有研究都必须双盲

Module 14 Complete

“网络可以看到成对比较所不能看到的：治疗选择的整体情况。”

并非每个信号都是真实的。

模块 15：个人

What was hidden in plain sight?

模块 15：个人

🎯 Learning Objectives

Explain why aggregate data can mask treatment–covariate interactions
Distinguish one-stage from two-stage IPD models
Recognize ecological bias in aggregate meta-analysis
Understand the practical challenges of IPD collection
Interpret treatment–covariate interaction plots

For decades, breast cancer trials

已发布摘要。不是患者。

早期乳腺癌试验者合作小组 (EBCTCG) 在数百项试验中收集了超过 100,000 名女性的个人记录。他们的 IPD 荟萃分析表明，他莫昔芬的益处在很大程度上取决于雌激素受体状态，而雌激素受体状态在汇总数据中是不可见的。

摘要隐藏的内容

每项已发表的他莫昔芬试验都报告了总体结果。在数百项研究中，他莫昔芬似乎提供了一定的益处。但“适度的获益”是一个平均值，掩盖了深刻的事实。

隐藏的亚组分裂

RR 0.59

ER-positive subgroup: 41% reduction in recurrence

RR 0.97

ER-negative subgroup: essentially no benefit at all

总体汇总效应（混合有反应和无反应的患者）是统计虚构的。 “适度”平均值夸大了一组的获益，而暗示另一组没有获益。

总体与个体患者数据

AD

Aggregate: published effect + CI only

IPD

Individual: raw patient-level records

IPD 允许：(1) 一致的结果定义，(2) 按患者特征进行亚组分析，(3) 事件时间建模，(4) 检查生态偏差。它是 gold standard for exploring treatment effect modification.

Researcher

One-Stage vs Two-Stage IPD

1

Two-Stage

Analyze each study separately, then combine estimates (like standard MA). Simple but loses information.

2

One-Stage

同时将单个混合效应模型拟合到所有患者数据。对于交互和罕见事件更强大。

Key: 两者都应该考虑研究聚类。切勿像从一项大型试验中那样汇集 IPD，这会带来混淆（辛普森悖论）。

Methodologist

Ecological Bias

A meta-regression using study-level mean age might show older patients benefit more. But this could be ecological bias- 研究级别的关联并不反映患者级别的真相。只有 IPD 可以分离 within-study from between-study effects.

当整体与部分相关时

辛普森悖论：当数据按混杂变量分组时，聚合数据中出现的趋势会逆转。

实践中的悖论

A mega-trial analysis found Treatment X beneficial overall. But 每个研究，它是有害的。如何？研究之间基线风险的差异造成了一种错觉——病情较重的人群碰巧接受了更多的治疗，从而夸大了总体效益。

Cates (2002, BMJ）表明，在不考虑聚类的情况下汇总研究可以扭转效果的明显方向。

这就是为什么 IPD 一阶段模型将研究作为聚类变量，以防止研究间混杂伪装成治疗效果。

EBCTCG 遗产

40 年来，EBCTCG 的 IPD 荟萃分析一直在定义乳腺癌治疗。他们 2005 年对他莫昔芬与不治疗的分析显示，对 ER 阳性肿瘤有明显的获益（RR 0.59），但对 ER 阴性肿瘤没有获益（RR 0.97）。

如果没有 IPD，两组的总体总体效应将被汇总——稀释了获益，并可能否认 ER 阳性患者的获益程度。

Decision Tree: When Is IPD Worth Pursuing?

Do you suspect treatment–covariate interactions?

Yes →

您能否从 >80% 的试验中获得 IPD？

Yes → One-stage IPD meta-analysis with interaction terms

No → 两阶段：请求可用的 IPD + 其余部分的聚合

No →

Is ecological bias a concern?

Yes → IPD preferred even without interactions

No → Aggregate data meta-analysis may suffice

EBCTCG 收集了 40 年来数百次试验的数据。大多数 IPD 荟萃分析涉及 5-20 项试验。决定取决于问题，而不是野心。

Methodologist

模式重复

还记得模块 3 吗？ HRT 在观察性研究中似乎有益，但在随机对照试验中却有害。发生了相同的总体掩盖：总体利益掩盖了亚组危害。

妇女健康倡议的 IPD 分析后来表明 timing mattered— 绝经 10 年内开始 HRT 的女性与更晚开始 HRT 的女性有不同的结果。 “时间假设”在已发表的汇总摘要中是不可见的。

这个教训再次出现： 汇总数据可能会掩盖关键的治疗-协变量相互作用。无论是乳腺癌的 ER 状态还是 HRT 的时间安排，个体层面的数据揭示了摘要所隐藏的内容。

Module 15 Quiz

Q1. IPD 相对于聚合数据荟萃分析的主要优势是什么？

A. 它总是包含更多研究

B. 它更便宜、更快

C. It can explore treatment–covariate interactions without ecological bias

D. 它消除了需要随机效应模型

Module 15 Complete

“每个汇总估计的背后都有个人，他们的故事总体无法讲述。”

异质性是一条消息，而不是噪音。

模块 16：剂量

异质性是一条消息，而不是噪音。

模块 16：剂量

🎯 Learning Objectives

Explain why simple pairwise comparisons miss dose–response relationships
Distinguish linear, quadratic, and spline dose–response models
Interpret restricted cubic splines with knots
Identify threshold effects and J/U-shaped curves
Understand model comparison with AIC/BIC

几十年来，适量饮酒

似乎可以保护心脏。

“J 形曲线”显示不饮酒者的心血管死亡率高于适度饮酒者。但斯托克韦尔等人。 (2016) 证明 J 曲线是将以前饮酒者（因病戒酒）错误分类为“戒酒者”的人为因素。

A Scientific Consensus Built on Sand

到 2010 年，超过 100 项观察性研究证实了 J 曲线。医学教科书是这么教的。心脏病专家引用了它。葡萄酒行业游说者资助了围绕它的会议。

100+

证实 J 曲线的观察研究

15–25%

Lower cardiovascular mortality in moderate drinkers vs abstainers

证据似乎压倒性的。但如果对照组——“戒酒者”——被污染了怎么办？

病态戒烟者

A Hidden Confounder

The Problem

People who stop drinking often do so because they are already ill—肝脏疾病、药物相互作用、癌症诊断。这些“以前的饮酒者”在大多数研究中被归类为“戒酒者”。

The Effect: The reference group (abstainers) appeared less healthy——不是因为戒酒有害，而是因为病人加入了戒酒。

When Stockwell et al. (2016, J Stud Alcohol Drugs) removed former drinkers and applied appropriate study-quality corrections: J 曲线消失了。保护效果是幻觉。

Dose–Response Meta-Analysis

Standard meta-analysis asks: "Does treatment X work?" Dose–response meta-analysis asks: "At what dose 治疗 X 效果最好吗？”它模拟了多项研究中剂量水平和结果之间的关系。

Linear

Simplest: log(RR) = β × dose

Spline

Flexible: piecewise polynomials with knots

Fractional

Polynomial: dose^p1 + dose^p2

Researcher

Restricted Cubic Splines

RCS place knots 在预先指定的剂量点并在它们之间拟合平滑多项式。通常在剂量分布的分位数处有 3-5 个结。超出边界结的线性。非线性测试将样条模型与更简单的线性进行比较模型。

AIC

Model Comparison

AIC/BIC 比较线性拟合与样条拟合。较低 = 更好。同时测试线性度的偏差（样条项的 p 值）。

Interactive: Dose–Response Builder

比较线性拟合、二次拟合与样条拟合。观察模型形状如何随不同假设而变化。

酒精 J 曲线揭秘

斯托克韦尔2016年的重新分析发现，当以前的饮酒者被正确排除在“戒酒者”参考组之外时，适度饮酒的保护作用就消失了。 J 曲线是由生病戒烟者偏差驱动的。

剂量反应荟萃分析揭示了真相：曲线的形状主要取决于您如何定义“零剂量”。错误的参考类别造成了虚假的好处。

When Curves Shape Policy

The phantom J-curve influenced alcohol guidelines worldwide:

UK

NHS Guidance (until 2016)

“适量饮酒可保护心脏”出现在官方指南中。斯托克韦尔修正后，英国将 all 饮酒者的限制修订为每周 14 单位（之前男性为 21 单位）。没有任何数量被宣布为“安全”。

US

Dietary Guidelines Advisory Committee

2015 年引用了 J 曲线研究。2020 年委员会建议将男性饮酒限制降至 1 次/天，承认参考组偏差。

AU

Australian Guidelines

Safe drinking limits were delayed by industry-funded J-curve research promoting “cardioprotective” moderate intake.

Decision Tree: Is Dose-Response Analysis Appropriate?

您是否有 ≥3 的暴露水平（不仅仅是暴露与未暴露）？

Yes →

是关系似乎是非线性的？

Yes → Restricted cubic splines (3–5 knots). Compare AIC with linear model.

No → Linear dose-response meta-regression may suffice

No →

Standard pairwise meta-analysis (no dose-response possible with only two levels)

Warning: 始终检查 - 您的参考类别是否干净？ J 曲线课程：受污染的参考组会产生幻象非线性。

Module 16 Quiz

Q1. What makes restricted cubic splines useful in dose–response meta-analysis?

A. They always produce a straight line

B. They flexibly capture non-linear dose–response curves

C. 它们减少了所需的研究数量

D. They simplify the model to fewer parameters

Module 16 Complete

“剂量产生毒物。曲线的形状揭示了毒物是否真实。”

缺乏证据并不等于不存在。

模块 17：脆弱性

缺乏证据并不等于不存在。

模块 17：脆弱性

🎯 Learning Objectives

计算和解释脆弱性指数
使用 GOSH 图识别有影响力的研究和子集效应
Interpret contour-enhanced funnel plots
应用 Copas 选择模型和 PET-PEESE 进行发表偏见
Understand how sensitivity analyses strengthen meta-analytic conclusions

Governments stockpiled billions

基于他们看不到的证据。

H1N1流感之后，各国政府花费了数十亿美元用于奥司他韦（达菲）库存。 Cochrane 团队（Jefferson et al. 2014）多年来一直在努力获取未发表的数据。当他们最终这样做时，预防并发症的证据就消失了。

脆弱性指数

脆弱性指数要求： "How many patients would need to change outcome to flip a statistically significant result to non-significant?" 它迭代地在事件较少的组中添加事件（将非事件转换为事件），直到 p > 0.05.

FI = 1

Extremely fragile. One patient flip changes conclusion.

FI > 8

Reasonably robust. Less sensitive to individual outcomes.

Interactive: Fragility Calculator

Enter a 2×2 table to calculate the fragility index. Watch events shift until significance flips.

Events

Total N

Treatment

Control

Researcher

GOSH Plots

研究异质性图形概述 （GOSH）将荟萃分析模型拟合到所有可能的研究子集。每个点都绘制了一个子集的汇总效应与 I² 的关系。聚类表明不同的子组；异常云表明一项研究驱动了异质性。

对于 k 个研究，有 2^k−1 subsets. For k > 15, random sampling is used.

Researcher

Contour-Enhanced Funnel Plots

Standard funnel plots show effect size vs standard error. Contour-enhanced 版本添加了 p < 0.01、p < 0.05 和 p < 0.10 的阴影区域。如果缺失的研究落在不重要的区域，则可能存在发表偏倚。如果它们落在重要区域，则其他原因（例如研究质量）可能会解释不对称性。

Methodologist

Copas Selection & PET-PEESE

1

Copas Selection Model

将研究发表的概率建模为其 SE 和效应大小的函数。联合评估真实效果和选择机制。

2

PET-PEESE

Precision-Effect Test (PET): regress effects on SE. If intercept = 0, no true effect. PEESE uses SE² for better performance when a true effect exists.

奥司他韦传奇

罗氏资助的原始荟萃分析（Kaiser 2003）显示奥司他韦使流感并发症减少了 67%。但 10 项试验中有 8 项从未发表。 Cochrane 获得临床研究报告后，并发症的获益降至不显着的 11%。

脆弱性不仅仅是统计上的，而是信息性的。证据基础本身丢失了大部分数据。

决策树：解释您的脆弱性结果

您计算了脆弱性指数。这个数字意味着什么？

FI ≤ 3

Highly fragile. 少数不同的事件会推翻结论。请极其谨慎地解释。

FI 4–8

Moderately fragile. 对小扰动敏感。是否有未发表的试验可能会改变这种情况？

FI > 8

Relatively robust. But remember: fragility is only one dimension. Publication bias can undermine even robust results.

Walsh et al. (2014, J Clin Epidemiol）发现，在顶级期刊上发表的 399 项随机对照试验中，脆弱性指数中位数仅为 8。超过 25% 的 FI ≤ 3。影响临床实践的里程碑式试验常常被统计线索所困扰。

Methodologist

Beyond the Index: Structural Fragility

奥司他韦传奇揭示了 three types of fragility——而脆弱性指数仅捕获了首先。

1

Statistical Fragility (FI)

有多少事件翻转 p 值？这就是脆弱性指数的衡量标准。它量化了对个体患者结果的敏感性。

2

Informational Fragility

有多少证据被隐藏？十项罗氏奥司他韦试验中有八项尚未发表。证据基础在结构上不完整。

3

Analytical Fragility

有多少研究人员自由度可以改变结论？不同的结果定义、分析人群或统计方法。

回调至模块 10（帕罗西汀）： 使用不同结果定义的重新分析完全推翻了结论。这就是分析的脆弱性——FI 从未被计算过，因为终点本身存在争议。完整的稳健性评估检查所有三个维度。

Module 17 Quiz

Q1. 一项试验每组有 200 名患者，12 个治疗事件，25 个对照事件 (p=0.03)。脆弱性指数为 3。这意味着什么？

A. 效果大小恰好为 3

B. Changing just 3 patient outcomes would flip the result to non-significant

C. 通过 3 项验证性研究，结果非常稳健

D. 该研究至少需要 3 名患者

Module 17 Complete

“每次尝试打破它时幸存的数字就是值得的数字

并非每个信号都是真实的。

模块 18：权益

Certainty must be earned, not assumed.

模块 18：权益

🎯 Learning Objectives

Identify how trial exclusion criteria create evidence gaps
应用 PROGRESS-Plus 框架评估证据的公平性
Use PRISMA-Equity reporting guidelines
Understand transportability: when trial findings fail in practice
Design equity-sensitive search and synthesis strategies

SPRINT proved tight blood pressure control

saves lives. But whose lives?

具有里程碑意义的 SPRINT 试验排除了患有糖尿病、既往中风和心力衰竭的患者。超过 75% 的美国高血压患者不符合资格。证据有力，但适用范围较窄。

幻灯片 A：缺失的大多数

排除了大多数患者的试验

SPRINT 招募了 9,361 名患者，并证明强化血压控制（目标 <120 mmHg）可将心血管事件减少 25%（HR 0.75，95% CI 0.64–0.89）。但纳入标准却讲述了不同的故事。

谁被排除在外：

Diabetes — 35% 的美国成年人患有高血压
Prior stroke — 8% 的高血压人群
Symptomatic heart failure — 6% of hypertensive adults
Expected survival <3 years — 最虚弱的患者
Nursing home residents — excluded entirely
GFR <20 mL/min — advanced kidney disease

结果：结束75% 患有高血压的美国成年人不符合资格。证据是有力的。但对于谁呢？

幻灯片 B：证据地理

证据来自

78%

of cardiovascular mega-trial participants came from high-income countries (2000–2020).

6%

from sub-Saharan Africa — where cardiovascular disease is rising fastest.

Polypill 试验：5 项试验中的 4 项是在平均 BMI <25 的人群中进行的。美国平均 BMI 为 30。不同人群的药物代谢、合并症模式、医疗保健获取和遗传变异都不同。 Efficacy in one population does not guarantee effectiveness in another.

参考：多国试验和 PROGRESS-Plus 差距

PROGRESS-Plus Framework

P

Place of residence

R

Race / ethnicity

O

Occupation

G

Gender / sex

R

Religion

E

Education

S

SES (socioeconomic)

S

Social capital

Plus: Age, disability, sexual orientation, other vulnerable groups.

Researcher

PRISMA-Equity & Transportability

PRISMA-Equity 将 PRISMA 扩展为要求报告审查中如何解决公平问题：人口特征、按劣势进行的亚组分析以及对服务不足的适用性的评估

Transportability：试验效果并不等于现实世界的效果。存在重新加权试验数据以匹配目标人群分布的方法。

幻灯片 C：可移植性问题

Researcher

From Trial to Real World: Transportability

Transportability >= 试验人群 X 的结果是否可以应用于目标人群 Y？这不是一个哲学问题——它有正式的方法。

1

Inverse Probability of Participation Weighting (IPPW)

Re-weights trial participants so they resemble the target population on key covariates.

2

Generalizability Index

量化试验样本与目标人群在观察到的特征上的相似程度。

Stuart et al. (2015, Stat Med): 当 SPRINT 结果重新加权以匹配美国高血压人群时，估计的益处被削弱——HR 0.82（对比试验中的 0.75）。治疗仍然有效。但当人群发生变化时，幅度也会发生变化。

SPRINT 和缺失的大多数

SPRINT 是一项针对 9,361 名患者精心设计的试验。其发现（强化血压控制与标准血压控制的 HR 为 0.75）改变了全世界的指南。但随后的分析显示，在最像试验人群的亚组中获益最强，而对于被排除的群体则不确定。

证据合成的公平性意味着不仅仅要问“它有效吗？”但是“它对谁有效？”

决策树：您的审核的公平性评估

ROOT: 您的审核证据是否来自与您的目标相似的人群？

YES → Good. But check: Are subgroups (age, sex, ethnicity, SES) reported separately?

Yes: Use subgroup effects for population-specific recommendations
No: Flag as limitation — equity gap in reporting

NO → Does PROGRESS-Plus analysis reveal differential effects?

Yes: Population-specific recommendations needed. Consider transportability re-weighting.
No: Cautious generalization with explicit equity statement in discussion

幻灯片 E：回调模块 3

Methodologist

Callback: The HRT Lesson Revisited

还记得模块 3 吗？ HRT 故事表明 healthy-user bias 使有害的治疗显得有益。 SPRINT 可能有相反的问题——“健康志愿者”效应可能会出现有效的治疗 more effective than it would be in the real world.

每个荟萃分析都应该问：谁被包括在内？谁被排除在外？这重要吗？

Module 18 Quiz

Q1. What does the PROGRESS-Plus framework help reviewers assess?

A. Statistical heterogeneity

B. Equity and applicability across disadvantaged populations

C. 纳入研究的内部有效性

D. 证据的总体确定性

Module 18 Complete

“排除弱势群体的证据不能声称为他们服务。”

并非每个信号都是真实的。

模块 19：机器

没有出处的数字不是数字。

模块 19：机器

🎯 Learning Objectives

Describe how AI/ML is used in systematic review screening
Explain active learning and human-in-the-loop workflows
Assess automation validation: recall, workload savings, and risk
认识到局限性并算法筛选的偏差
在证据合成中应用负责任的人工智能使用框架

When COVID-19 hit,

papers arrived faster than humans could read.

到 2021 年，已有超过 300,000 篇新冠肺炎论文。 Cochrane 使用机器学习分类器对快速评论的研究进行分类，将筛查工作量减少多达 70%，同时保持 >95% 的召回率。

The Flood

By April 2020, 4,000 COVID preprints appeared every week.

PubMed indexed 500 new COVID articles per day.

Cochrane's screening queue hit 10,000 unreviewed titles.

🔍 不可能的数学

A pair of reviewers screens ~200 titles per day.

At 500 new articles/day, they fell further behind with every hour.

活生生的评论在它能够生存之前就已经死了。

第一个尝试

这个想法并不新鲜。科恩等人。 (2006，JAMIA) 首先表明机器学习可以减少 50% 的筛选工作量，而召回率损失不到 5%。

📅

2006: Cohen et al. — SVM classifiers for drug class reviews. Proof of concept.

📅

2016: RobotReviewer (Marshall et al., JMLR) — ML for risk of bias assessment. Inter-rater reliability comparable to human reviewers.

📅

2021: ASReview (van de Schoot et al., Nature Machine Intelligence) — active learning that simulated 95% workload reduction.

但是模拟并不现实。 COVID 将是第一个真正的大规模测试。

AI in Systematic Reviews

1

Screening Prioritization

Active learning ranks citations by relevance. Reviewers screen the most likely relevant first.

2

数据提取辅助

NLP 提取 PICO 元素、结果和结果。始终需要人工验证。

3

Risk of Bias Assessment

ML classifiers predict RoB domains. Experimental—human judgment remains gold standard.

Researcher

Validating Automation

Recall

>95% required. Missing 1 study can change conclusions.

WSS@95%

Work Saved over Sampling at 95% recall.

Stopping

When to stop screening? Consecutive irrelevant threshold.

基本张力： 自动化节省了时间，但引入了新的错误源。始终报告工具、版本、训练数据和停止标准。

验证危机

🔍 验证悖论

要了解机器是否错过了相关研究， you need a human to screen everything.

But if humans screen everything, 为什么使用机器？

The solution: prospective holdout validation.

Random 10% sample screened by both human and machine
比较：机器是否错过了人类发现的内容？
If recall drops below 95%, retrain and expand human screening

信任，但验证。机器赢得了它的角色——它不会继承它。

Cochrane's COVID Response

Cochrane 使用经过数百万条记录训练的机器学习分类器构建了 COVID-19 研究登记册。该系统实现了 99% 的灵敏度，同时将手动筛查从几周缩短到几天。

但机器是一个工具，而不是替代品。每一项纳入的研究仍然经过人类评审员的验证。教训：人工智能增强了审稿人的能力，而不是取代他们。

几乎找不到的研究

2020 年 6 月，RECOVERY 试验公布了其地塞米松结果——the first treatment proven to reduce COVID mortality (28-day mortality: 22.9% vs 25.7%, RR 0.83).

预印本出现在 medRxiv 上，标题不标准。类似的情况在大流行期间反复发生：机器学习分类器经过现有术语的训练，将不熟悉的框架排名较低。

在几篇实时评论中，人类评论员扫描标记的标题识别了关键药物名称，并升级了分类器已降低优先级的研究。

如果没有这些人，里程碑式的治疗结果可能要等几周才能进入现实世界。审查。

机器读取速度更快。人类阅读得更深。两者都不够。

Decision Tree: When Should You Use AI?

您的评论将筛选超过 5,000 个标题？

Yes → Consider AI-assisted screening

Active learning prioritization. Dual-screen random 10% holdout. Stop when 3 consecutive batches yield 0 relevant studies.

Report: classifier type, training data, recall on holdout, stopping rule.

No → Manual screening is feasible

For <5,000 titles, dual human screening remains gold standard. AI adds complexity without proportionate benefit.

这是实时评论还是快速评论？

If yes → AI is especially valuable. Continuous classifier retraining on new evidence. But: 永远不要让机器做出最终的收录决定。

模式重复

Methodologist

模式重复

还记得模块 6 吗？ Poldermans 编造的 DECREASE 数据指导了围手术期 β 受体阻滞剂指南长达十年。

AI can now detect statistical anomalies automatically:

GRIM test: 报告的平均值是否与整数样本量一致？
SPRITE: 报告的汇总统计数据是否可以根据可信的个体数据重建？
Statcheck: Do reported p-values match the test statistics?

这些工具发现了异常情况在 hundreds of published papers—faster than any human auditor.

但是机器标志。人类法官。撤回的决定仍然是非常人性化的。

Module 19 Quiz

Q1. 系统评价中人工智能辅助筛查的最低可接受召回率是多少？

A. 80%

B. 90%

C. >95%

D. 100%

Module 19 Complete

“机器阅读速度更快。人类阅读更深。他们一起阅读真相。”

并非每个信号都是真实的。

模块 20：定性

方法保护患者免受我们的信任。

模块 20：定性

🎯 Learning Objectives

Explain why some questions require qualitative evidence synthesis
Describe meta-ethnography (Noblit & Hare) and thematic synthesis
Apply the CERQual framework to assess confidence in qualitative findings
Understand mixed-methods synthesis approaches
Recognize when qualitative evidence changes practice

世界卫生组织提出了一个问题

没有随机对照试验可以回答。

为什么全世界的妇女在分娩过程中都会遭受不尊重和虐待？博伦等人。 (2015) 将来自 34 个国家的 65 项定性研究综合为七个虐待领域的框架。

幻灯片 A：超越随机化的问题

超越随机化的问题

2014 年，世界卫生组织召集了一个小组来解决全球危机：妇女受到身体虐待、言语虐待被羞辱，并且在分娩时得不到照顾。这并不是一个罕见的事件——报告来自 34 countries.

They needed to understand WHY. What drives disrespect and abuse in maternity care?

没有 RCT 可以回答这个问题。你不能将女性随机分配到虐待性护理还是尊重性护理。你不能让接生员失明。你无法用李克特量表来衡量“尊严”。 证据必须是定性的。

Meta-Ethnography

Developed by Noblit & Hare (1988), meta-ethnography translates 跨研究的概念而不是汇总数字。它从一阶（参与者引用）和二阶（作者解释）数据生成新的解释框架（三阶结构）。

Reciprocal

研究相互证实

Refutational

研究相互矛盾

Line of
argument

研究建立新理论

What Bohren Found: A Taxonomy of Mistreatment

1. Physical abuse

Hitting, pinching, slapping during labor

2. Sexual abuse

Inappropriate touching, non-consensual procedures

3. Verbal abuse

Shouting, threats, judgmental comments

4. Stigma & discrimination

Based on HIV status, ethnicity, age, poverty

5. Professional standards failure

Neglect, lack of informed consent

6. Poor rapport

Poor communication, dismissiveness

7. Health system conditions

Overcrowding, understaffing, lack of supplies

65研究。 34 个国家。同样的模式在各种语言、文化和系统中重复出现。这不是轶事。这是综合证据。

Researcher

CERQual：定性证据的置信度

CERQual assesses confidence in qualitative review findings across four components:

1

Methodological Limitations

贡献研究的质量。

2

Coherence

数据支持研究结果的程度。

3

Adequacy

数据的丰富性（不仅仅是数据的数量）研究）。

4

Relevance

对审查问题背景的适用性。

幻灯片 C：从证据到行动

When Qualitative Evidence Changes Practice

Bohren's synthesis informed the WHO's 2018 Recommendations on Intrapartum Care for a Positive Childbirth Experience. Specific changes grounded in qualitative evidence:

Rec. 15

Companionship during labor

Rec. 1

Respectful maternity care

Rec. 3

Effective communication

Rec. 12

Emotional support

这些建议基于定性证据，现在指导 194 个 WHO 成员国的产妇护理。任何林地都不可能产生它们。没有 I² 统计数据可以揭示它们。

Bohren's Framework of Mistreatment

2015 年的定性综合报告确定了七个领域：身体虐待、性虐待、言语虐待、耻辱和歧视、未能达到专业标准、关系不佳以及卫生系统状况。该框架为世界卫生组织关于产时护理的建议（2018）提供了信息。

没有 p 值可以捕捉分娩期间被打耳光的经历。定性综合表达了数字无法表达的内容。

Decision Tree: When Is Qualitative Synthesis Appropriate?

ROOT: 您的研究问题是关于经验、感知、障碍还是促进因素吗？

YES → 您的问题是关于如何或为什么，而不仅仅是“是否”？

Yes: Qualitative evidence synthesis (meta-ethnography, thematic synthesis, or framework synthesis)
No: 考虑混合方法：定量的效果 + 定性的效果机制

NO >→ 您的问题是关于有效性/功效吗？

Yes: Quantitative meta-analysis
But: 补充实施障碍的定性审查（CERQual 评估）

Key insight: 最强的系统审查回答两个问题：它有效吗？（定量）以及为什么它有效或失败？（定性）

Module 20 Quiz

Q1. What distinguishes meta-ethnography from quantitative meta-analysis?

A. 仅包括 3–5 个研究

B. It translates concepts across studies rather than pooling numbers

C. It does not require a systematic search

D. It is less rigorous than quantitative synthesis

Module 20 Complete

“并非所有重要的内容都可以计算在内。并非所有计算在内的内容都重要。”

异质性是一条消息，而不是噪音。

模块 21：多元

异质性是一条消息，而不是噪音。

模块 21：多元

🎯 Learning Objectives

识别研究中的结果何时相关
Explain multivariate random-effects models
Apply robust variance estimation (RVE) for dependent effect sizes
了解嵌套数据的三级模型
Choose between multivariate approaches based on data structure

Cardiovascular trials report

死亡率、MI、中风和更多。

这些结果在患者内部相关。死亡的患者没有 MI 终点。标准荟萃分析独立对待每个结果，忽略依赖性和潜在的重复计算证据。

幻灯片 A：便利谎言

无人质疑的假设

打开任何标准荟萃分析教科书。这些模型假设每项研究都贡献 one independent effect size. But reality is different.

单个心血管试验报告死亡率、心肌梗塞、中风和血运重建。一项心理治疗研究报告了 3、6 和 12 个月时的抑郁、焦虑和生活质量。

30 trials

× 4 outcomes

= 120

effect sizes

Most analysts either: (a) treat all 120 as independent (inflating precision by a factor of √4), or (b) 选择一个结果并丢弃其余的。 两种方法都是错误的。

依赖性问题

In standard pairwise meta-analysis, each study contributes one effect size. But many studies report multiple outcomes, subgroups, timepoints, or arms—creating dependent 效应大小。忽略这一点会增加精度并扭曲推理。

RVE

Robust Variance Estimation. Sandwich estimator handles unknown correlation.

3-Level

Study → Outcome nesting modeled explicitly.

Researcher

Robust Variance Estimation

RVE (Hedges, Tipton & Johnson, 2010) uses a sandwich-type 估计器提供有效的标准误差，无论相关效应之间的真实相关性如何。无需了解或估计研究内相关性。最适合 ≥20 个研究。

Small-sample correction: Tipton 和 Pustejovsky (2015) 在聚类数量较小时使用 Satterthwaite 自由度开发了 RVE 的小样本校正 (CR2)。

幻灯片 B：数学真相

Researcher

What Dependence Does to Your Confidence Intervals

如果同一研究的 4 个结果有研究内相关性 ρ = 0.5：

Treating as independent

CI width = X

考虑依赖性

CI width = 1.58X

您的置信区间应为 58% wider。每一项忽略这一点的荟萃分析都会发表错误的精确结果。

RVE (Hedges, Tipton & Johnson, 2010): Uses a “sandwich” variance estimator that produces correct standard errors without needing to know the exact within-study correlation.

Researcher

Three-Level Models: Making Structure Explicit

1

Level 1: Sampling Variance

Measurement error within each effect size estimate.

2

Level 2: Within-Study Variance

单个研究中的结果和时间点各不相同。

3

Level 3: Between-Study Variance

研究在人群、环境和方法方面彼此不同。

Example: 在抑郁症心理治疗的荟萃分析中（k=50项研究，180个效应）大小）， 35% 方差是研究内（不同结果）， 65% 方差是研究间（不同疗法、人群）。这种分解揭示了异质性有多大 within vs between studies.

Methodologist

Three-Level Models: Formal Framework

当效应嵌套时（例如，研究内的多个结果，或研究组内的研究）， three-level model 将方差划分为：(1) 抽样方差（级别 1）、(2) 研究内方差（级别 2）和 (3) 研究间方差（级别 3）。这在跨级别借用力量的同时保持了正确的推论。

心血管挑战

他汀类药物的荟萃分析可能包括 30 项试验，每项试验报告死亡率、心肌梗死、中风和血运重建。即来自 30 个簇的 120 个效应大小。将它们视为 120 个独立估计会通过与研究内相关性相关的因素来提高精确度。

RVE or multivariate models handle this correctly—producing wider, honest confidence intervals.

Decision Tree: Which Approach for Dependent Effect Sizes?

ROOT: 您的荟萃分析是否对每个研究有多重效应？

YES >→ 您知道（或可以估计）研究内相关性吗？

Yes: Multivariate random-effects model (most efficient)
No: RVE with small-sample correction (robust to unknown correlations)

NO → Standard univariate random-effects model

Sub-question: 您的多重效应是来自不同结果、时间点还是

Different outcomes → Three-level model or RVE with clustering
Different timepoints → Network of timepoints with temporal correlation
Different subgroups → Consider if subgroups are meaningful or should be averaged

Module 21 Quiz

Q1. What problem does Robust Variance Estimation (RVE) solve?

A. Publication bias

B. 同一研究的多个效应大小之间的依赖性

C. Between-study heterogeneity

D. Small-study effects

Module 21 Complete

“当结果相互纠缠时，假装它们是独立的只是方便的谎言。”

没有出处的数字不是数字。

模块 22：证明

没有出处的数字不是数字。

模块 22：证明

🎯 Learning Objectives

Understand how computational errors propagate through policy
定义可再现性并区别于可复制性
应用证据哈希和携带证明的数字
Use reproducibility checklists for meta-analysis
认识到预注册和开放数据的作用

A graduate student opened a spreadsheet

并发现紧缩时代是建立在错误之上的。

2010年，莱因哈特和罗格夫声称，债务与 GDP 比率超过 90% 的国家出现了负增长。这影响了整个欧洲的紧缩政策。 2013 年，Thomas Herndon 发现 Excel 中存在一个错误，将 5 个国家/地区排除在平均值之外。修正后的结果：适度正增长，而不是崩溃。

Reproducibility vs Replicability

Reproducible

Same data + same code = same result

Replicable

新数据+相同方法=一致结果

Reproducibility is the minimum standard。如果其他人无法从您报告的数据中重现您的汇总估计，则无法验证分析。荟萃分析应共享：提取的数据、分析脚本、软件版本和随机种子。

Researcher

Proof-Carrying Numbers

Every number in a meta-analysis should carry its provenance：它来自哪里、如何转换以及生成它的代码。 Evidence hashing creates a cryptographic fingerprint of inputs so any change (accidental or deliberate) is detectable.

SHA

Input Hash

提取数据的 SHA-256 哈希值。如果一个单元格发生变化，哈希值就会发生变化。来源链：数据 → 代码 → 结果 → 哈希。

Interactive: Reproducibility Checklist

勾选每个项目以评估荟萃分析的再现性。您的评论得分如何？

改变经济的 Excel 错误

Reinhart-Rogoff 的“债务时代的增长”在国会证词、欧盟委员会报告和国际货币基金组织政策简报中被引用。 Excel 错误（第 30-34 行被排除在 AVERAGE 公式之外）意味着五个国家/地区（澳大利亚、奥地利、比利时、加拿大和丹麦）完全缺失。

校正后的平均值从 -0.1% 变为 +2.2%。紧缩政策影响了数百万人。再现性不是学术上的完美主义——它是防止灾难的保障。

Remember Module 5?

DECREASE Through the Lens of Reproducibility

Don Poldermans 的 DECREASE 试验因数据捏造而被撤回。如果携带证明的数字存在——散列输入、出处链、经过验证的计算——捏造行为将是可检测的 before 证据进入荟萃分析并改变手术指南。

Module 22 Quiz

Q1. Reinhart-Rogoff错误是什么？

A. They used too small a sample

B. An Excel formula excluded 5 countries, reversing the conclusion

C. They studied the wrong time period

D. They used the wrong statistical test

Module 22 Complete

“没有出处的数字不是数字。没有再现性的分析不是证据。”

Certainty must be earned, not assumed.

模块 23：您的第一次元冲刺

Certainty must be earned, not assumed.

模块 23：您的第一次元冲刺

🎯 Learning Objectives

了解 40 天系统审核工作流程
Map the Seven Principles to real practice phases
Recognize Definition-of-Done (DoD) gates as quality checkpoints
Appreciate why structure prevents the failures you've studied
Graduate ready to conduct (not just understand) meta-analysis

旅程完成

您已经了解了这些故事。

现在您必须走这条路。

您研究的每一个证据逆转的发生都是因为团队 knew 方法，但没有 follow them systematically.

META-SPRINT 框架

具有 5 个阶段门的 40 天结构化工作流程。每个门都是一个完成定义 (DoD) 检查点，在质量得到保证之前，它会阻止您继续前进。

40

Days to Completion

5

DoD Phase Gates

Day 34

Hard Freeze

Why 40 days? 足够长以保证严格性，足够短以防止范围蔓延。由于没有强制透明的最后期限，罗格列酮心脏信号被隐藏了多年。

五个门

五个阶段门

A

DoD-A: Protocol Lock (Days 1-3)

PICOS defined, timepoint rules set, model choices pre-specified. No moving target.

B

DoD-B: Search Lock (Days 6-10)

All databases searched, grey literature checked, PRESS validated. No hidden studies.

C

DoD-C: Extraction Lock (Days 10-28)

Dual extraction, provenance linked, RoB assessed. No fabricated numbers.

The Five Phase Gates (continued)

D

DoD-D：分析锁定（第 21-33 天）

Forest plots generated, sensitivity analyses run, heterogeneity explored. No cherry-picking.

E

DoD-E: Submission Lock (Days 33-40)

GRADE certainty rated, clinical summary written, manuscript finalized. No overconfidence.

Day 34 Freeze: 在第 1 天之后无法添加新研究34. 这可以防止困扰 BMP 脊柱手术荟萃分析的“武器化范围蔓延”，业界不断“寻找”有利的研究。

实践中的七个原则

Every principle you learned maps to a specific phase gate:

DoD-A "并非每个信号都是真实的” - 预先指定什么算作证据

DoD-B "What was hidden in plain sight?" — Search comprehensively

DoD-C "没有出处的数字不是数字”——链接每个数据点

DoD-D "异质性是一条消息，而不是噪音" — Investigate, don't ignore

DoD-E "Certainty must be earned, not assumed" — GRADE everything

红队原则

你自己的团队试图打破你的

每天，两名轮换团队成员会花 12 分钟作为对手检查数据质量。这就是 Boldt 的欺诈行为被发现的方式 - 不是通过友好审查，而是通过怀疑性检查发现不可能的招募率。

CondGO: When Things Go Wrong

What happens when you discover a critical problem mid-sprint?

CondGO = Conditional Go

A bounded rescue protocol. You have exactly 72 hours 仅使用允许的操作来解决问题。如果你无法修复它，你必须停止审查。

📖 文迪雅的教训： GSK在2000年看到了心血管信号，但没有强制的最后期限。他们“观望、等待”了7年。数万人受到伤害。 CondGO 的存在是因为“我们最终会处理它”会害死人。

您以故事开始本课程。

您以准备练习结束本课程。

META-SPRINT 工作流程将您学到的所有内容构建到一个 40 天的系统中，以防止您遇到的故障研究过。

当您准备好进行真正的系统评价时，请打开 META-SPRINT 应用程序。您在此处学到的故事将指导您 — 在每一步中都作为提醒出现。

故事：CTT 合作 — 当方法节省数百万美元

What does it look like when every principle is followed?

REAL DATA

胆固醇治疗试验者 (CTT) 合作是荟萃分析的黄金标准。他们获得了来自 170,000 多名参与者的个体患者数据 across 26 statin trials. Pre-specified protocol. IPD from all major trials. Standardized outcomes. Result: statins reduce major vascular events by 21% per mmol/L LDL reduction (RR 0.79, 95% CI 0.77-0.81), regardless of baseline risk. This finding, replicated across 15 年来的 5 次荟萃分析, has prevented an estimated millions of heart attacks and strokes worldwide.

应用的七个原则

CTT 故事展示了遵循本课程中的每条原则时会发生什么。考虑替代方案：

路径 A：没有原则

No protocol. Published data only. No RoB. No heterogeneity investigation. No GRADE.

↓

Conflicting small trials. Statin controversy persists. Millions untreated.

OUTCOME: Preventable cardiovascular deaths continue

路径 B：CTT 方式

预注册协议。所有试验的 IPD。标准化结果。透明的方法。等级高确定性。

↓

明确答案。全球指导方针发生变化。他汀类药物是为受益者开的。

OUTCOME: Millions of lives saved by rigorous evidence synthesis

THE REVELATION

本课程中的每一项原则都存在，因为它的缺失会造成伤害。 CTT 合作证明，当方法严谨、数据有出处、评估偏倚并获得确定性时，荟萃分析将成为医学中最强大的工具。你现在遵循这些原则。使用它们。

Capstone Quiz

1。 META-SPRINT 中第 34 天“硬冻结”的目的是什么？

A.留出时间进行同行评审

B.防止后期添加的研究操纵结果

C. To speed up publication

D.为了与期刊截止日期相协调

2. The CondGO protocol gives teams how long to fix critical problems?

A. 24 hours

B. 48 hours

C. 72 hours

D. 1 week

3. Red-team adversarial QA caught Joachim Boldt's fraud by noticing:

A. Impossible patient recruitment rates

B. p-hacking in statistical tests

C. Inconsistent effect sizes

D. Whistleblower testimony

您学到的故事不是历史。

它们是保护您未来工作的警告。

当您进行第一次荟萃分析时，
remember CAST before you trust a signal,
remember Poldermans before you skip provenance,
在忽略漏斗之前记住瑞波西汀。

您现在已经准备好了。遵循结构。谦虚地去吧。遵循七项原则。

并非每个信号都是真实的。

模块 24：期末考试

Certainty must be earned, not assumed.

Final Examination

Final Exam: Part 1 of 2

测试您对荟萃分析原理的掌握程度。每个问题都涉及课程中的一个核心概念。

Q1. 研究人员想要研究“运动对健康的影响”。此研究问题的主要问题是什么？

A. It lacks randomization

B. Sample size is too small

C. It is not answerable—lacks specific PICO elements

D. It lacks ethical approval

Q2. 漏斗图显示明显的不对称性，左下区域缺少研究。这表明什么？

A. Large studies have more precise estimates

B. 小型阴性研究可能未发表

C. The true effect is stronger than estimated

D. Random sampling error

Q3. 荟萃分析报告 I² = 85% 和 τ² = 0.42。最合适的解释是什么？

A. There is an 85% chance of a true effect

B. The effect size is very large

C. Substantial between-study variance exists; investigate sources

D. 结果具有临床重要意义

Q4. 在 GRADE 中，随机对照试验证据的起始确定性是什么？

A. High

B. Moderate

C. Low

D. Very low

Q5. In RoB 2.0, which domain assesses whether outcome assessors knew the treatment allocation?

A. D1: Randomization process

B. D2：偏离预期干预措施

C. D3：缺失结果数据

D. D4：结果测量

Final Exam: Part 2 of 2

Q6. CAST试验表明，抗心律失常药物尽管抑制心律失常，但仍增加了死亡率。这是一个示例：

A. Random sampling error

B. Surrogate outcome failure

C. Confounding by indication

D. Reverse causation

Q7. When should a random-effects model be preferred over a fixed-effect model?

A. When sample sizes are large

B. 当结果是二元时

C. When between-study heterogeneity is expected

D. When publication bias is suspected

Q8. According to ICEMAN criteria, which makes a subgroup analysis MORE credible?

A. Hypothesis specified a priori

B. Large number of subgroups tested

C. No biological rationale

D. Inconsistent effects across trials within subgroup

Q9. What assumption must be checked in network meta-analysis to ensure valid indirect comparisons?

A. All studies have equal sample sizes

B. 所有研究都测量相同的结果

C. Transitivity (consistency of effect modifiers)

D. Double-blinding in all trials

Q10. 在试验序贯分析（TSA）中，跨越无效边界表明什么？

A. 治疗原因危害

B. 进一步的研究不太可能显示出有意义的效果

C. 有益的证据是确凿的

D. 荟萃分析动力不足

Part 1 Complete — continue to Part 2 (Advanced Modules)

第2部分：高级模块问题(Q11-Q25)

Final Exam: Part 2 of 2 (Advanced)

Questions 11–25 cover Modules 13–22 (Bayesian, NMA, IPD, Dose-Response, Fragility, Equity, AI, Qualitative, Multivariate, Reproducibility).

Q11. 在贝叶斯荟萃分析中，当您在许多研究中使用模糊先验时会发生什么？

A. 后验与频率论结果非常匹配

B. 先验主导后验

C. The credible interval becomes infinitely wide

D. 模型无法收敛

Q12. 在 Cipriani 的抗抑郁药 NMA 中，为什么没有单一药物被宣布为“获胜者”？

A. 研究太少

B. Different drugs ranked best on different outcomes

C. 没有可用的间接证据

D. 无法计算 SUCRA

Q13. 为什么应该您从来没有像一次大型试验那样汇集 IPD？

A. IPD always has fewer studies than aggregate

B. 它忽略了研究聚类并引入了混杂因素

C. 它无法处理事件时间数据

D. Binary outcomes cannot be pooled

Q14. What caused the alcohol "J-curve" to disappear in Stockwell's reanalysis?

A. 添加了新研究，但没有显示任何益处

B. 前饮酒者已从戒酒者参考中正确删除组

C. 样本量增加

D. 更好地调整混杂因素

Q15. 在奥司他韦传奇中，Cochrane 在访问未发表的临床研究报告时发现了什么？

A. 该药物完全无效

B. 效果比最初想象的要大

C. 对并发症的益处基本消失

D. Side effects were more common than reported

Q16. 美国高血压患者中有多少百分比不符合SPRINT试验的资格？

A. About 25%

B. About 50%

C. Over 75%

D. Nearly 100%

Q17. Why is AI considered an "augmenter" rather than a "replacer" in systematic reviews?

A. AI is slower than human reviewers

B. AI has perfect recall

C. AI screens fast but cannot make human-level contextual judgments

D. AI is too expensive for most reviews

Q18. What does the "adequacy" component of CERQual assess?

A. 研究数量仅

B. 支持该发现的数据的丰富性和数量

C. 研究结果的一致性

D. Generalizability to other populations

Q19. A meta-analysis includes 30 statin trials, each reporting 4 correlated outcomes (120 effect sizes). Which approach is correct?

A. Treat all 120 as independent effect sizes

B. Use RVE with small-sample correction

C. Pick only one outcome per study

D. 对每项研究中的4个结果进行平均

Q20. 在Reinhart-Rogoff误差中，高债务的修正平均增长率是多少国家？

A. −0.1% (same as claimed)

B. +2.2%

C. 0%

D. +5%

Passing Score: 15/20 across both parts

返回相关模块查看任何遗漏的问题。每个问题都测试一个核心概念。

并非每个信号都是真实的。

方法保护患者免受我们的信任。

Congratulations

您已完成证据逆转：荟萃分析课程。

愿您的综合以真理为指导，您的智慧汇集，
并以谦逊的态度得出结论。

七原则：

“并非每个信号都是真实的。”

“方法保护患者免受我们的信任。”

"What was hidden in plain sight?"

“没有来源的数字不是一个“

“异质性是一条消息，而不是噪音。”

“缺乏证据并不等于不存在。”

"Certainty must be earned, not assumed."

“引导我们走上正路...”

Your Progress

七原则

Badges Earned

Learning Streak

模块0：开头

🎯 Learning Objectives

What is Meta-Analysis?

为什么要进行合并研究？

Increase Statistical Power

Improve Precision

Resolve Disagreement

Explore Heterogeneity

何时不合并

证据的层次结构

七原则

Module 0 Quiz

1.为什么有时不应该在荟萃分析中汇集研究？

2。 RCT 的系统评价在证据层次中位于何处？

模块 1：问题

🎯 Learning Objectives

The Observation

The Response

逻辑让每个人都相信

CAST: The Cardiac Arrhythmia Suppression Trial

结果：1989 年 4 月

人类成本

逻辑 - 重新审视

What Went Wrong: The Surrogate Trap

PICO 框架

调查练习：CAST 之前的证据

Before: Observational Logic

After: CAST RCT (1989)

证据综合的教训

生物学合理性并不是证据

Surrogate endpoints can mislead

随机试验提供了最强的因果证据

共识不是证据

REAL DATA

Module 1 Quiz

1。抗心律失常逻辑中的根本错误是什么？

2。在 PICO 中，“O”代表什么以及为什么它很重要？

模块 2：协议

🎯 Learning Objectives

护士健康研究

隐藏的偏见

WHI: The Women's Health Initiative

结果：2002 年 7 月

REAL DATA

PROSPERO Registration

搜索前注册

锁定您的决定

Document Amendments

Prevent Duplication

Module 2 Quiz

1。为什么护士健康研究显示 HRT 有益而 WHI 没有？

2. What is the primary purpose of PROSPERO registration?

模块 3：搜索

🎯 Learning Objectives

已发表的证据（2007 年之前）

Nissen's Discovery: May 2007

荟萃分析结果

The FDA Advisory Committee: July 2007

The Aftermath

What a Comprehensive Search Requires

新闻清单

研究问题的翻译

布尔和邻近运算符

Subject Headings

Text Words

PRESS Checklist (continued)

Spelling, Syntax, Line Numbers

限制和过滤器

Database Translation

REAL DATA

Module 3 Quiz

1。什么类型的证据来源揭示了罗格列酮心血管信号？

2. What does PRESS stand for?

模块 4：筛选

🎯 Learning Objectives

万络的崛起