Evidence Reversal: A Meta-Analysis Course

Not every signal is truth.

Module 0: The Opening

🎯 Learning Objectives

Define meta-analysis and explain its role in evidence synthesis
Identify when studies should NOT be pooled
Describe the evidence hierarchy and where systematic reviews sit
Recognize that meta-analysis can mislead when done poorly
Recall the Seven Principles that anchor this course

This course exists because

medicine was wrong.

Not once. Not rarely. Repeatedly. In ways that killed patients who trusted that the evidence was sound.

What is Meta-Analysis?

A statistical method for combining results from multiple independent studies addressing the same question.

1976

Term coined by Gene Glass

~50,000

Published per year

#1

Evidence hierarchy*

*When well conducted. Quality of conduct matters more than study design alone — as GRADE recognizes.

Why Pool Studies?

1

Increase Statistical Power

Individual studies may be too small to detect effects.

2

Improve Precision

Narrower confidence intervals around effect estimates.

3

Resolve Disagreement

When studies conflict, pooling can clarify the signal.

4

Explore Heterogeneity

Identify why effects differ across populations or settings.

But meta-analysis can also

MISLEAD

When done poorly, it amplifies bias rather than truth.

When NOT to Pool

1

Studies measure fundamentally different things (apples and oranges)

2

Extreme heterogeneity that cannot be explained

3

One study dominates all others (megastudy problem)

4

Studies have high risk of bias that cannot be adjusted for

Pooling is a privilege, not a right.

The decision to combine must be defended.

The Hierarchy of Evidence

Systematic Reviews & Meta-Analyses of RCTs

Randomized Controlled Trials

Cohort Studies

Case-Control Studies

Case Series / Expert Opinion

Position in hierarchy depends on methodology quality, not study type alone.

This course teaches through

evidence reversals.

Each module opens with a story of how medicine got it wrong. Then we learn the method that would have prevented the harm.

The Seven Principles

These phrases will return throughout your journey:

1. "Not every signal is truth."

2. "Methods protect patients from our confidence."

3. "What was hidden in plain sight?"

4. "The number without provenance is not a number."

5. "Heterogeneity is a message, not noise."

6. "Absence of evidence is not evidence of absence."

7. "Certainty must be earned, not assumed."

Module 0 Quiz

1. Why should you sometimes NOT pool studies in a meta-analysis?

A. Pooling is always better than single studies

B. When heterogeneity is extreme or studies measure different things

C. Pooling is always appropriate for RCTs

D. Statistical methods handle any situation

2. Where do systematic reviews of RCTs sit in the evidence hierarchy?

A. At the top

B. Same level as individual RCTs

C. Below cohort studies

D. Same as expert opinion

Begin the journey.

Module 1: The Question

Not every signal is truth.

This is not a story about error.

It is a story about certainty.

Module 1: The Question

🎯 Learning Objectives

Formulate a focused PICO question for a systematic review
Distinguish surrogate outcomes from patient-important outcomes
Explain why biological plausibility alone is insufficient evidence
Describe the CAST trial and its implications for evidence-based medicine
Apply the principle: "Not every bright sign is guidance"

~9,000

excess deaths per year

From a treatment everyone believed worked.

This is the story of how we believed - and how we were wrong.

The Observation

Patients with frequent PVCs after MI had 2-5x higher mortality.

400,000+

MI survivors/year

~40%

with significant PVCs

160,000

at elevated risk

A massive clinical need. A clear target.

The Response

Antiarrhythmic drugs were developed, FDA approved,
and prescribed to ~200,000 patients per year.

No villain appears in this story.

Everyone acted on the best evidence available.

The Logic That Convinced Everyone

PREMISE 1

PVCs after MI predict sudden cardiac death

↓

PREMISE 2

Antiarrhythmic drugs suppress PVCs

↓

PREMISE 3

Suppressing PVCs should prevent sudden death

↓

CONCLUSION

Antiarrhythmics save lives in post-MI patients

The chain was logical. The conclusion felt inevitable.

CAST: The Cardiac Arrhythmia Suppression Trial

Finally, someone asked: "Does suppressing PVCs actually save lives?"

Design

Randomized, double-blind, placebo-controlled

Population

Post-MI patients with asymptomatic PVCs

Intervention

Encainide, flecainide, or moricizine vs placebo

Run-in

Only patients with ≥80% PVC suppression randomized

Primary endpoint

Death or cardiac arrest with resuscitation

Sample size

1,498 patients (encainide/flecainide arms)

The Results: April 1989

Data Safety Monitoring Board stops the trial early.

Outcome	Drug (n=755)	Placebo (n=743)
Arrhythmic deaths	33	9
All cardiac deaths	43	16
Total deaths	56	22
Death rate	7.4%	3.0%

Relative Risk of Death: 2.5

95% CI: 1.6 - 4.5 | p < 0.001

The drugs that perfectly suppressed arrhythmias increased mortality by 150%.

The Human Cost

Before CAST, ~200,000 Americans per year received these drugs.

~9,000

excess deaths per year - possibly more

Vietnam War: ~6,000 US deaths/year • These drugs: ~9,000+ deaths/year

For every number, a name we will never know.

Look again.

The Logic - Revisited

PREMISE 1

PVCs after MI predict sudden cardiac death

↓

PREMISE 2

Antiarrhythmic drugs suppress PVCs

← THE LEAP

↓

PREMISE 3

Suppressing PVCs should prevent sudden death

↓

CONCLUSION

Antiarrhythmics save lives in post-MI patients

The assumption that suppressing the marker would fix the outcome was never tested.

What Went Wrong: The Surrogate Trap

1

PVCs were a marker of damaged tissue, not a cause of death

2

The drugs had proarrhythmic effects - triggering deadlier rhythms

3

The surrogate improved while the outcome worsened - a dissociated surrogate

The surrogate did not lie. We asked it the wrong question.

The PICO Framework

Every answerable clinical question has four components:

P - POPULATION

Who are the patients? What are their characteristics?

I - INTERVENTION

What treatment or exposure is being evaluated?

C - COMPARATOR

What is the alternative? Placebo? Standard care?

O - OUTCOME

What matters to patients? Hard endpoints vs surrogates.

CAST PICO

Post-MI patients with PVCs | Antiarrhythmics | Placebo | Mortality

🔍

Investigation Exercise: The Evidence Before CAST

You are a cardiologist in 1988. A patient has survived an MI but has frequent PVCs. The observational literature is clear...

Study	Patients with PVCs	Mortality Risk
Lown (1977)	High-grade PVCs	2.4x higher
Bigger (1984)	>10 PVCs/hour	3.1x higher
Mukharji (1984)	Complex PVCs	4.8x higher

The signal is clear. The mechanism is plausible. Would you prescribe antiarrhythmics?

Before: Observational Logic

PVCs → Higher mortality

Drugs suppress PVCs

∴ Drugs should reduce mortality

After: CAST RCT (1989)

Death rate on drug: 7.4%

Death rate on placebo: 3.0%

RR = 2.5 (150% increase in deaths)

The surrogate improved. The patients died. This is why we ask: "What is the outcome that matters?"

The Lessons for Evidence Synthesis

1

Biological plausibility is not proof

A logical mechanism doesn't guarantee the expected effect.

2

Surrogate endpoints can mislead

Improving a biomarker doesn't prove improvement in outcomes.

3

Randomized trials provide the strongest causal evidence

Observational data alone rarely establishes causation for interventions due to confounding.

4

Consensus is not evidence

200,000 prescriptions, FDA approval, and guidelines were all wrong.

This is why we do meta-analysis: to see past apparent truths.

What if the question you ask determines who lives and who dies?

REAL DATA

In 1989, cardiologists knew that PVC suppression was achievable with encainide and flecainide. The surrogate endpoint looked perfect: drugs suppressed PVCs by 80%+. But CAST randomized 1,498 patients to active drug vs placebo. The trial was stopped early: 56 deaths in the drug group vs 22 in placebo. Mortality increased 2.5-fold. An estimated ~9,000 excess American deaths per year were attributable to these drugs.

The Cardiologist's Choice: 1987

Your post-MI patient has frequent PVCs. You have drugs that suppress them completely. What do you do?

PATH A: Treat the Surrogate

Prescribe encainide — PVCs vanish, the ECG looks clean

↓

The biomarker improves. You feel confident. The patient dies.

OUTCOME: An estimated 50,000+ excess deaths across the US during years of use

PATH B: Demand a Mortality Trial

Insist: "Show me that survival improves, not just the ECG"

↓

The trial reveals harm. The drugs are withdrawn. Lives are saved.

OUTCOME: The right PICO question prevents a catastrophe

THE REVELATION

The question was never "Can we suppress PVCs?" It was "Does PVC suppression save lives?" A surrogate endpoint answered the wrong question. The right PICO would have demanded mortality as the outcome from the start.

What appears certain may be wrong.

What everyone believes may be false.

Methods exist so patients do not pay for our confidence.

This is why you are here.

Module 1 Quiz

1. What was the fundamental error in the antiarrhythmic logic?

A. The trials were not randomized

B. Treating a surrogate (PVCs) was assumed to improve outcomes

C. The sample size was too small

D. The FDA approval was rushed

2. In PICO, what does the "O" stand for and why does it matter?

A. Observation - what researchers see

B. Objective - the research goal

C. Outcome - what matters to patients

D. Organization - study structure

Not every signal is truth.

Methods protect patients from our confidence.

What was hidden in plain sight?

This is a story about

observational evidence.

Module 2: The Protocol

🎯 Learning Objectives

Explain why protocol pre-registration prevents bias
Identify key elements of a PROSPERO registration
Distinguish healthy user bias from true treatment effects
Describe why observational studies overestimated HRT benefits
Apply the principle: "Methods protect patients from our confidence"

30+

observational studies

All showing hormone replacement therapy protected postmenopausal women from heart disease.

The evidence seemed overwhelming. The conclusion seemed certain.

The Nurses' Health Study

122,000 nurses followed for decades. HRT users had 40-50% lower cardiovascular mortality.

RR 0.56

Cardiovascular mortality

122,000

Women followed

20+ years

Follow-up

Landmark study. Impeccable methodology. Wrong conclusion.

The Hidden Bias

1

Healthy User Bias: Women who chose HRT were healthier, wealthier, better educated

2

Compliance Bias: Women who took HRT consistently also took better care of themselves

3

Prescriber Bias: Doctors gave HRT to healthier women with fewer risk factors

The treatment wasn't protecting them. They were already protected.

WHI: The Women's Health Initiative

The largest randomized trial of HRT ever conducted.

Design

Randomized, double-blind, placebo-controlled

Population

Postmenopausal women aged 50-79

Intervention

Estrogen + Progestin vs Placebo

Sample size

16,608 women

Primary endpoint

Coronary heart disease

Planned duration

8.5 years

The Results: July 2002

Trial stopped early after 5.2 years. Harm exceeded benefits.

Outcome	Hazard Ratio	Direction
Coronary heart disease	1.29	HARM
Stroke	1.41	HARM
Breast cancer	1.26	HARM
Pulmonary embolism	2.13	HARM

Complete Reversal

30 years of observational evidence overturned

The Lesson

PRE-SPECIFY

A protocol written before the search begins prevents fishing, prevents bias, prevents hindsight distortion.

What if the treatment works—but only for some?

REAL DATA

WHI showed HRT increased cardiovascular events overall. But later analyses revealed a critical pattern: women who started HRT within 10 years of menopause had REDUCED cardiovascular risk. Women starting 20+ years after menopause had INCREASED risk. The overall null/harm result hid a timing effect.

The Analyst's Dilemma

You are analyzing WHI subgroups. The overall result shows harm. Do you dig deeper?

PATH A: Report Overall Only

Conclude HRT is harmful for all postmenopausal women

↓

Simple message. Guidelines recommend against HRT universally.

OUTCOME: Deny potential benefit to younger menopausal women

PATH B: Pre-Specify Timing Subgroups

Analyze by years since menopause (biologically plausible)

↓

Discover the "timing window" for safe HRT initiation.

OUTCOME: Enable personalized recommendations

THE REVELATION

Subgroup analysis is dangerous when fishing. It is essential when biology predicts effect modification. The timing hypothesis was biologically plausible—and should have been pre-specified.

PROSPERO Registration

1

Register Before You Search

PROSPERO: International prospective register of systematic reviews

2

Lock Your Decisions

PICO, search strategy, outcomes, analysis plan - all pre-specified

3

Document Amendments

Changes are allowed but must be transparent and justified

4

Prevent Duplication

Check if your review already exists before starting

Module 2 Quiz

1. Why did the Nurses' Health Study show benefit from HRT that WHI did not?

A. Nurses' Health had too few patients

B. Healthy user bias in observational studies

C. Nurses' Health had shorter follow-up

D. Different hormone formulations were used

2. What is the primary purpose of PROSPERO registration?

A. To register clinical trials

B. To speed up review completion

C. To pre-specify methods and prevent bias

D. To obtain funding for reviews

Pre-specification is not bureaucracy.

It is protection.

Against our own tendency to find what we expect.

Methods protect patients from our confidence.

What was hidden in plain sight?

Module 3: The Search

What was hidden in plain sight?

This is a story about

what they didn't publish.

Module 3: The Search

🎯 Learning Objectives

Develop a comprehensive search strategy using PRESS guidelines
Search multiple databases including grey literature sources
Identify trial registries and regulatory databases (ClinicalTrials.gov, FDA)
Explain how the rosiglitazone case exposed hidden cardiovascular harms
Apply the principle: "What was hidden in plain sight?"

$3.2B

annual sales at peak

Avandia (rosiglitazone) was one of the world's best-selling diabetes drugs.

The published trials looked reassuring. The unpublished ones told a different story.

The Published Evidence (Pre-2007)

Published trials showed rosiglitazone effectively lowered HbA1c. Cardiovascular outcomes were rarely reported.

1999

FDA approval

6M+

Patients treated

~0.7%

HbA1c reduction

The surrogate looked good. But what about actual cardiovascular events?

Nissen's Discovery: May 2007

Dr. Steven Nissen obtained unpublished trial data from GSK's own website.

GSK had been required by legal settlement to post clinical trial results online. Nissen and Wolski analyzed 42 trials - many never published in journals.

The data was technically public.

No one had systematically searched for it.

The Meta-Analysis Results

Outcome	Odds Ratio	95% CI
Myocardial Infarction	1.43	1.03 - 1.98
CV Death	1.64	0.98 - 2.74

43% Increased Risk of Heart Attack

p = 0.03 for myocardial infarction

Published in NEJM. The FDA called an emergency advisory committee meeting.

The FDA Advisory Committee: July 2007

22-1

Voted: CV risk exists

20-3

Keep on market with warnings

The committee was divided. Some wanted it withdrawn. Some called the meta-analysis flawed.

But the signal could not be unseen.

The Aftermath

1

Black box warning added for heart failure risk (2007)

2

Severe restrictions on prescribing in the US (2010)

3

Withdrawn from European market entirely (2010)

4

FDA now requires cardiovascular outcome trials for all diabetes drugs

What a Comprehensive Search Requires

PUBLISHED

PubMed, Embase, CENTRAL, Web of Science

GREY LITERATURE

Conference abstracts, dissertations, regulatory docs

TRIAL REGISTRIES

ClinicalTrials.gov, WHO ICTRP, EU CTR

REGULATORY

FDA, EMA, Health Canada submissions

COMPANY DATA

GSK, Pfizer, Roche clinical trial registries

HAND SEARCH

Reference lists, contact authors, experts

The PRESS Checklist

Peer Review of Electronic Search Strategies

1

Translation of Research Question

Does the search reflect the PICO elements?

2

Boolean and Proximity Operators

Are AND, OR, NOT used correctly?

3

Subject Headings

Are MeSH/Emtree terms appropriate and exploded?

4

Text Words

Synonyms, spelling variants, truncation?

PRESS Checklist (continued)

5

Spelling, Syntax, Line Numbers

Are there errors that would cause retrieval failures?

6

Limits and Filters

Are date, language, study design limits appropriate?

Peer-reviewed searches substantially improve retrieval of key studies.

PRESS guideline: McGowan et al., 2016

Database Translation

The same search must be adapted for each database:

PubMed

"diabetes mellitus, type 2"[MeSH] OR "type 2 diabetes"[tiab]

Embase

'non insulin dependent diabetes mellitus'/exp OR 'type 2 diabetes':ti,ab

Subject headings, field tags, and operators differ between databases.

What happens when you search — and find nothing?

REAL DATA

Governments stockpiled $9 billion of oseltamivir (Tamiflu) for pandemic flu. The Cochrane Collaboration tried to review the evidence. Of 77 clinical trials, full reports existed for only 20. Roche refused to share data for 5 years. When the BMJ and Cochrane finally obtained over 160,000 pages of clinical study reports, they found: Tamiflu reduced symptoms by less than 1 day, with no evidence it prevented hospitalizations or complications.

The Reviewer's Dilemma: 2009

You are updating a Cochrane review of Tamiflu. Published trials look positive. But 57 trials have no accessible full reports. What do you do?

PATH A: Analyze What's Published

Use the 20 available trials. Conclude Tamiflu is effective.

↓

Your review supports continued stockpiling. $9 billion spent on weak evidence.

OUTCOME: Billions wasted, true efficacy unknown

PATH B: Demand Complete Data

Refuse to publish until all trial data is accessible

↓

5-year campaign. 160,000+ pages finally obtained. Truth emerges.

OUTCOME: Evidence policy changed; EMA now publishes all trial reports

THE REVELATION

A search is only as good as what is findable. When the grey literature is hidden behind corporate walls, even the most comprehensive PubMed search will miss the truth. The Tamiflu saga changed global policy: the EMA now publishes clinical study reports for all medicines.

If Nissen had searched only PubMed,

the signal would have remained hidden.

Comprehensive search is survival.

What was hidden in plain sight?

Module 3 Quiz

1. What type of evidence source revealed the rosiglitazone cardiovascular signal?

A. Published journal articles

B. Cochrane Library

C. Company clinical trial registry

D. FDA approval documents

2. What does PRESS stand for?

A. Publication Review of Evidence Search Standards

B. Peer Review of Electronic Search Strategies

C. Protocol for Reporting Evidence Synthesis Studies

D. Primary Research Evidence Search System

What was hidden in plain sight?

Module 4: The Screening

The number without provenance is not a number.

This is a story about

what they chose to report.

Module 4: The Screening

🎯 Learning Objectives

Apply PRISMA flow diagram to document study selection
Implement dual-reviewer screening with conflict resolution
Identify selective outcome reporting and data manipulation
Calculate inter-rater reliability (Cohen's kappa)
Apply the principle: "The number without provenance is not a number"

88,000

heart attacks attributed to Vioxx

A blockbuster drug. A hidden signal. A preventable catastrophe.

Between 1999 and 2004, millions took this painkiller. Some never came home.

The Rise of Vioxx

Rofecoxib (Vioxx) was a COX-2 selective NSAID. Marketed as safer for the stomach than traditional painkillers.

1999

FDA approval

$2.5B

Peak annual sales

80M+

Patients prescribed

The VIGOR Trial (2000)

Vioxx Gastrointestinal Outcomes Research

Design

Randomized, double-blind

Comparison

Vioxx vs Naproxen

Population

Rheumatoid arthritis

Sample

8,076 patients

Primary Outcome

GI events

Published

NEJM, November 2000

What VIGOR Published

GI Outcome	Vioxx	Naproxen
Confirmed GI events	2.1 per 100 pt-yrs	4.5 per 100 pt-yrs
Reduction	54% fewer GI events

The headline: Vioxx is safer for your stomach!

This is what doctors were told. This is what patients believed.

What VIGOR Buried

CV Outcome	Vioxx	Naproxen
Myocardial Infarction	20 events	4 events
Relative Risk	5x higher in Vioxx group

5-fold Increase in Heart Attacks

Mentioned only briefly, attributed to naproxen being "cardioprotective"

The Selective Reporting

1

Data cutoff manipulation: 3 additional heart attacks occurred after the cutoff used in publication

2

Spin: CV signal was explained away as naproxen being cardioprotective (no evidence)

3

Outcome switching: CV events were pre-specified but not emphasized

4

Internal knowledge: Merck emails show they knew about the signal

The APPROVe Trial (2004)

A trial for colorectal polyp prevention - stopped early for safety.

RR 1.92

CV events vs placebo

Sept 2004

Vioxx withdrawn

Four years after VIGOR showed a 5x risk. Four years too late.

Have you considered what happens when a signal hides in the noise?

REAL DATA

Vioxx (rofecoxib) was approved in 1999. By 2004, estimates suggest 88,000-140,000 excess heart attacks and 30,000-40,000 deaths. Merck's own VIGOR trial showed 5x cardiovascular risk in 2000—but it was dismissed as a "naproxen cardioprotective effect."

The Fork in the Road

You are an FDA reviewer in 2001. VIGOR data shows 5x heart attack risk with Vioxx vs naproxen.

PATH A: Accept the Explanation

Believe Merck's hypothesis: naproxen is cardioprotective

↓

No additional safety studies required. Drug stays on market at full speed.

OUTCOME: 40,000+ deaths over 4 years

PATH B: Demand Evidence

Require a dedicated CV safety trial before continued marketing

↓

Delay or restrict marketing until cardiovascular safety is established.

OUTCOME: Signal detected early, lives saved

THE REVELATION

The signal was there in 2000. The wrong explanation delayed action by 4 years. An alternative hypothesis—accepted without evidence—cost tens of thousands of lives.

The PRISMA Flow Diagram

Every step of screening must be documented and transparent.

Identification

Records from databases + other sources

↓

Screening

Title/abstract review (duplicates removed)

↓

Eligibility

Full-text assessment (with exclusion reasons)

↓

Included

Studies in synthesis

Dual Screening: Why Two Reviewers?

1

Reduces Selection Bias

One reviewer might unconsciously favor certain studies

2

Catches Errors

Fatigue, misreading, and mistakes are inevitable

3

Forces Explicit Criteria

Disagreements reveal ambiguity in inclusion rules

Typical agreement: κ = 0.6-0.8

Disagreements resolved by discussion or third reviewer

Calibration: The Pilot Phase

Before screening thousands of records, reviewers should calibrate on a sample of 50-100 records.

1

Screen the same set independently

2

Compare decisions and discuss disagreements

3

Refine inclusion criteria until κ > 0.7

4

Document the calibration process and any rule changes

PRISMA 2020 Updates

New in 2020

Separate reporting of database vs register searches

New in 2020

Automation tools must be reported

New in 2020

Citation searching documented separately

New in 2020

Reasons for exclusion at full-text mandatory

PRISMA 2020 substantially revised the checklist with expanded reporting on synthesis methods, certainty assessment, and protocol registration.

If Vioxx's cardiovascular data had been screened by independent reviewers,

if all pre-specified outcomes had been required to be reported,

88,000 heart attacks might have been prevented.

The number without provenance is not a number.

Module 4 Quiz

1. In the VIGOR trial, what was the relative risk of MI in the Vioxx group compared to naproxen?

A. 1.5x higher

B. 2x higher

C. 5x higher

D. 10x higher

2. Why is dual screening (two independent reviewers) important?

A. It makes screening faster

B. It reduces selection bias and catches errors

C. It reduces the number of studies to review

D. It allows reviewers to skip full-text review

The number without provenance is not a number.

Module 5: The Extraction

The number without provenance is not a number.

This is a story about

numbers that never existed.

Module 5: The Extraction

🎯 Learning Objectives

Design a standardized data extraction form with provenance fields
Calculate effect sizes from various reported statistics (OR, RR, HR, SMD)
Implement dual-extraction with discrepancy resolution
Identify red flags for data fabrication and misconduct
Explain how the DECREASE fraud affected clinical guidelines

~10,000

possible excess deaths in Europe

From guidelines based on fabricated clinical trial data.

The DECREASE trials influenced perioperative care worldwide. The data was invented.

Don Poldermans: A Star Researcher

Professor at Erasmus Medical Center, Rotterdam. Author of over 500 papers. Lead author of ESC guidelines on perioperative cardiac care.

500+

Publications

DECREASE

Trial series I-VI

ESC

Guideline chair

A seemingly unimpeachable source. Until someone looked at the data.

The DECREASE Trials: The Claim

Trial	Finding	Impact
DECREASE-I (1999)	90% reduction in cardiac death	Changed guidelines
DECREASE-IV (2009)	Beta-blockers safe in low-risk	Expanded recommendations

Effect sizes were implausibly large.

90% reduction? Almost nothing in medicine works that well.

The Investigation: 2011

1

Erasmus MC investigated after whistleblower complaints

2

Fabricated patient data: Patients who didn't exist or weren't enrolled

3

No informed consent: Many "participants" never consented

4

Poldermans dismissed: From Erasmus MC in 2011

The Cascade of Harm

When DECREASE was removed from meta-analyses...

Benefit → Harm

Direction reversed

27% ↑

Stroke risk increase

The POISE trial (2008) had shown harm. It was dismissed because it conflicted with DECREASE.

Why Wasn't This Caught?

1

Trust in authority: Poldermans was the guideline author reviewing his own evidence

2

No data verification: No one asked for individual patient data

3

Publication prestige: Published in top journals, assumed valid

4

Implausible effects accepted: 90% reductions should raise suspicion

Data Extraction: Defense Against Fraud

1

Dual Extraction

Two extractors independently - catches transcription errors and forces scrutiny

2

Record Provenance

Table, page, paragraph - every number traceable to source

3

Verify Against Registry

ClinicalTrials.gov results vs publication - discrepancies are red flags

4

Request IPD

Individual patient data reveals what aggregate summaries hide

Effect Size Calculation

During extraction, you calculate effect sizes from reported data:

BINARY OUTCOMES

Odds Ratio, Risk Ratio, Risk Difference from 2x2 tables

CONTINUOUS OUTCOMES

Mean Difference, Standardized Mean Difference from means and SDs

Always extract from the most reliable source.

Prefer: ITT results > per-protocol > subgroups

Red Flags During Extraction

!

Implausible effect sizes: 80-90% reductions should prompt scrutiny

!

Baseline imbalances: Groups that are "too perfectly" matched

!

Round numbers: "Exactly 50" or "exactly 100" patients per arm

!

Registry discrepancies: Published N differs from registered N

Researcher

Effect Size Conversions

Studies report results in different metrics. To pool them, you often need conversions:

From	To	Formula
SMD (d)	log-OR	log-OR = d × π / √3
log-OR	SMD (d)	d = log-OR × √3 / π
Correlation (r)	Fisher z	z = 0.5 × ln((1+r)/(1−r))
OR	RR	RR = OR / (1 − P₀ + P₀ × OR)
OR	NNT	NNT = 1 / (P₀ − OR×P₀ / (1−P₀+OR×P₀))

P₀ = baseline risk in the control group. These formulas assume approximate conditions; see Borenstein et al. (Ch. 7) for exact derivations.

Researcher

Time-to-Event (Survival) Data

Many trials report time-to-event outcomes using hazard ratios (HR). Pooling HRs in meta-analysis requires special handling:

1

The log(HR) + SE Method

Extract log(HR) and its SE from the trial. If not reported, derive SE from the CI: SE = (ln(upper) − ln(lower)) / (2 × 1.96). Pool using standard inverse-variance methods.

2

When HR Is Not Reported

Methods exist to reconstruct IPD from Kaplan–Meier curves (Guyot et al. 2012) or estimate HR from p-values and event counts (Parmar et al. 1998). Always prefer the directly reported adjusted HR when available.

HR < 1 favors treatment; HR > 1 favors control. Do not convert HRs to ORs or RRs—they measure fundamentally different quantities.

What if the data you extract was never real?

REAL DATA

Joachim Boldt was the most prolific researcher in anesthesia fluid management. Over 180 of his publications were retracted — one of the largest retraction cases in medical history. His fabricated data showed hydroxyethyl starch (HES) was safe. Meta-analyses that included his studies concluded HES was harmless. When Boldt's studies were removed, the pooled effect reversed: HES increased kidney injury by 59% (RR 1.59, 95% CI 1.26-2.00) and mortality by ~9% (RR 1.09). An estimated thousands of patients received a harmful fluid based on fabricated evidence.

The Extractor's Vigilance: 2010

You are extracting data for a fluid resuscitation meta-analysis. Boldt's studies dominate the literature (90+ papers). A whistleblower has raised concerns. What do you do?

PATH A: Extract as Published

Trust peer-reviewed publications. Extract Boldt's data like any other.

↓

Your meta-analysis shows HES is safe. Guidelines recommend it.

OUTCOME: Thousands receive a nephrotoxic fluid

PATH B: Verify Provenance

Cross-check ethics approvals, request source data, conduct sensitivity analysis excluding suspicious studies

↓

Discover missing ethics approvals. Flag studies. Re-analyze without them.

OUTCOME: True signal emerges — HES causes harm

THE REVELATION

Provenance is not bureaucracy. It is the difference between evidence and fiction. Every extracted number must trace to an ethics-approved study, with verifiable patient data. Without provenance, the number without an owner can become a weapon.

Every number in your meta-analysis

must trace back to a verifiable source.

The number without provenance is not a number.

Fraudulent data can kill as surely as fraudulent drugs.

Module 5 Quiz

1. What happened when DECREASE trial data was removed from beta-blocker meta-analyses?

A. The benefit became even larger

B. No change in conclusions

C. The direction reversed to show potential harm

D. Results became inconclusive

2. Why should dual extraction be standard practice?

A. It catches transcription errors and forces scrutiny

B. It makes extraction faster

C. It helps find more studies

D. It reduces the amount of work needed

The number without provenance is not a number.

Module 6: The Bias

Methods protect patients from our confidence.

This is a story about

the bias we cannot see.

Module 6: The Bias

🎯 Learning Objectives

Apply Risk of Bias 2.0 (RoB 2) to randomized trials
Apply ROBINS-I to non-randomized studies
Assess all five RoB 2 domains (randomization, deviations, missing data, measurement, selection)
Distinguish confounding by indication from true treatment effects
Explain how BART revealed hidden harms of aprotinin

20+

years on the market

Aprotinin was the gold standard for reducing surgical bleeding.

Then someone ran an RCT. The truth was different.

The Hidden Bias: Confounding by Indication

1

Sicker patients got aprotinin: Surgeons used it in complex, high-risk cases

2

Survivors bias: Dead patients can't report complications

3

Publication bias: Negative studies weren't published

Observational studies could not separate the drug's effect from the patient's baseline risk.

BART: The Randomized Truth

Blood Conservation Using Antifibrinolytics in a Randomized Trial

Outcome	Aprotinin	Alternatives
30-day mortality	6.0%	3.9%
Relative Risk	1.53 (53% increased death)

Trial Stopped Early for Harm

Withdrawn from market November 2007

🔍

Investigation: Assess the Bias

You are reviewing the observational studies. Apply Risk of Bias thinking:

Question	Observational	BART (RCT)
Random allocation?	❌ Surgeon choice	✓ Yes
Baseline comparable?	❌ Sicker got drug	✓ Balanced
Blinding?	❌ Open label	✓ Double-blind

Confounding by indication: Surgeons gave aprotinin to the sickest patients. The observational studies attributed survival to the drug, when they were measuring survivorship bias.

Risk of Bias 2.0: The Five Domains

D1

Randomization Process

D2

Deviations from Intended Interventions

D3

Missing Outcome Data

D4

Measurement of the Outcome

D5

Selection of the Reported Result

ROBINS-I: For Non-Randomized Studies

When RCTs are unavailable, use ROBINS-I (Risk Of Bias In Non-randomized Studies of Interventions)

1

Confounding

Baseline differences between groups

2

Selection of Participants

Exclusions related to intervention

3

Classification of Interventions

Misclassification of exposure status

4

Deviations from Intended Interventions

Co-interventions, contamination

5

Missing Data

Differential loss to follow-up

6

Measurement of Outcomes

Ascertainment bias

7

Selection of Reported Result

Selective reporting

Ratings: Low / Moderate / Serious / Critical / No information

What happens when 64 studies agree — and they are all wrong?

REAL DATA

Aprotinin was used in cardiac surgery to reduce bleeding for 20 years. 64 small randomized trials suggested it was safe and effective. Meta-analyses confirmed benefit. Then the BART trial (2008) randomized 2,331 patients: aprotinin vs. tranexamic acid vs. aminocaproic acid. Result: aprotinin increased mortality by 53% (RR 1.53, 95% CI 1.06-2.22). The trial was stopped early for harm. Bayer withdrew aprotinin from the market within months.

The Surgeon's Evidence: 2006

You are a cardiac surgeon choosing an antifibrinolytic. 64 small trials favor aprotinin, but none were powered to detect mortality. A large RCT (BART) is enrolling. Do you wait?

PATH A: Trust the Meta-Analysis

64 trials can't all be wrong. Continue prescribing aprotinin.

↓

Small trials measured bleeding, not death. None had adequate power for mortality. The meta-analysis pooled underpowered surrogate results.

OUTCOME: Excess deaths in cardiac surgery patients

PATH B: Assess Risk of Bias First

Rate all 64 trials with RoB. Notice they are small, use surrogate outcomes, and have high attrition. Wait for the adequately powered RCT.

↓

BART reveals the truth. Switch to safer alternatives.

OUTCOME: Lives saved by demanding adequately powered evidence

THE REVELATION

The quantity of evidence does not equal quality. Sixty-four underpowered trials measuring the wrong outcome do not outweigh one adequately powered trial measuring mortality. Risk of bias assessment is not a formality — it is the shield between patients and misleading conclusions from small, surrogate-driven evidence.

Sixty-four small trials measured bleeding, not death.

One adequately powered trial revealed 53% increased mortality.

Quantity of evidence cannot substitute for quality and power.

Module 6 Quiz

1. Why did 64 small trials miss aprotinin's harm?

A. Underpowered for mortality; used surrogate outcomes

B. Confounding by indication

C. Outcome measured incorrectly

D. Follow-up too short

Methods protect patients from our confidence.

Module 7: The Synthesis

Heterogeneity is a message, not noise.

The Magnesium Controversy: 1991-1995

When pooling leads us astray.

Module 7: The Synthesis

🎯 Learning Objectives

Calculate pooled effect sizes using fixed-effect and random-effects models
Choose between DerSimonian-Laird and HKSJ estimators appropriately
Interpret forest plots including weights, confidence intervals, and diamonds
Explain why small-study effects can mislead meta-analyses
Apply the principle: "Heterogeneity is a message, not noise"

The Year: 1991

"You stand at the crossroads of hope and evidence..."

Heart disease kills more people worldwide than any other cause. In 1991, a new hope emerges: Could something as simple and cheap as intravenous magnesium save lives after myocardial infarction?

The biological rationale was sound:

Magnesium stabilizes cardiac membranes, prevents arrhythmias, and vasodilates coronary arteries.

LIMIT-2: The Landmark Trial

Leicester Intravenous Magnesium Intervention Trial, 1992

2,316

Patients enrolled

24%

Mortality reduction

p = 0.04

Statistically significant

A cheap, safe intervention that could save 250,000 lives per year globally.

The medical community was electrified.

The Meta-Analysis: 1993

Researchers pooled seven randomized trials of IV magnesium in MI:

Trial	Year	N	Odds Ratio
Morton 1984	1984	40	0.10
Rasmussen 1986	1986	273	0.35
Smith 1986	1986	400	0.48
Abraham 1987	1987	94	0.87
Shechter 1990	1990	103	0.27
Ceremuzynski 1989	1989	48	0.22
LIMIT-2	1992	2,316	0.74

🔍

Investigation Exercise: The Meta-Analyst's Dilemma

You are a Cochrane reviewer in 1993. You have been asked to synthesize the evidence on magnesium for MI. The data from seven trials sits before you.

Do you see the pattern in this forest plot?

Pooled OR = 0.44 (95% CI: 0.27–0.71)

55% mortality reduction! Publish in the Lancet?

But wait... do you notice anything about the trial sizes?

The Warning Signs

What should have given us pause?

1

Small sample sizes: Six of seven trials had <500 patients

2

Extreme effects: OR of 0.10 (90% reduction) is implausible for any drug

3

All positive: Where were the negative trials? The file drawer problem...

4

Funnel asymmetry: Small trials showed much larger effects than larger ones

🔍

The Funnel Plot Test

Before we pool, we must check for publication bias. Let us examine the funnel plot.

The Year: 1995 — ISIS-4 Reports

"And then came the truth..."

The Fourth International Study of Infarct Survival (ISIS-4) enrolled 58,050 patients across 1,086 hospitals in 31 countries.

58,050

Patients

2,216

Deaths in Mg group

2,103

Deaths in placebo

OR = 1.06 (95% CI: 1.00–1.12)

No benefit. If anything, a trend toward harm.

📊

Before and After: The Complete Picture

Watch what happens when we add the mega-trial to our forest plot...

BEFORE ISIS-4

7 small trials (N = 3,274)

OR = 0.44

Strong benefit signal

AFTER ISIS-4

8 trials (N = 61,324)

OR = 1.02

No effect

Why Did Small Trials Mislead?

1

Publication Bias

Small negative trials were never published—they sat in file drawers

2

Small-Study Effects

Smaller trials tend to show larger effects due to methodological weaknesses

3

Random High Bias

By chance, some small trials hit extreme results—and those get published

4

Random-Effects Amplification

Random-effects models give more weight to small trials, amplifying bias

Fixed vs. Random Effects

Which model should you choose?

FIXED EFFECT MODEL

Assumes one true effect. Weights studies by inverse variance (precision). Large trials dominate.

Magnesium result: OR = 0.96 (p = 0.52)

RANDOM EFFECTS MODEL

Assumes distribution of effects. Gives more weight to small trials. Wider confidence intervals.

Magnesium result: OR = 0.59 (p = 0.01)

⚠️ The model choice determined the conclusion!

Random effects does not fix bias; with small-study effects, it may shift weight toward smaller trials and change conclusions.

The Lessons of Magnesium

1. Check for publication bias before trusting a pooled estimate. Use funnel plots and Egger's test when study count is sufficient (about 10 or more), and interpret asymmetry in context.

2. Be wary of small-study effects. If only small trials show benefit, wait for a large, well-conducted trial.

3. Model choice matters. Random effects does not repair biased evidence. Compare prespecified sensitivity analyses (fixed, random, influence analysis) and explain why conclusions are robust.

4. One large trial can overturn many small ones. This is why mega-trials like ISIS-4 are so valuable.

Researcher

Special Study Designs in Meta-Analysis

Not all RCTs use standard parallel-group designs. Two common alternatives require special handling when pooling results:

1

Cluster-Randomized Trials

Randomize groups (hospitals, schools), not individuals. The design effect = 1 + (m−1) × ICC reduces the effective sample size. Divide N by the design effect before pooling, or use the adjusted SE from the trial. Ignoring clustering produces artificially narrow CIs.

2

Crossover Trials

Each patient receives both treatments. The paired design reduces variance, but you need the within-patient correlation (or the paired analysis SE) to pool correctly. Using the parallel-group SE is conservative; using the wrong N double-counts patients.

See Cochrane Handbook v6.4, Chapter 23 for detailed formulas and worked examples.

What if the way you combine studies determines whether a treatment looks life-saving or useless?

REAL DATA

Early surfactant for premature babies was supported by 6 small trials showing reduced mortality (RR 0.84). A fixed-effect meta-analysis confirmed benefit (p=0.04). But a random-effects model showed no significance (p=0.12) — the confidence interval crossed 1.0. Later, SUPPORT (2010) and VON (2012), two large pragmatic trials with ~2,000 neonates combined, found no benefit of early vs. later surfactant. Clinical practice had been changed based on small trials and the wrong model.

The Neonatologist's Model Choice: 2005

You are updating a Cochrane review of early surfactant. Six small trials show benefit with a fixed-effect model. The random-effects model is non-significant. Which do you report?

PATH A: Report Fixed-Effect Only

Fixed-effect is significant. Report the positive result. Change practice.

↓

NICUs adopt early surfactant. Later trials show no benefit. Practice reverses.

OUTCOME: Years of unnecessary intubation of premature infants

PATH B: Report Both Models

Show FE and RE results. Flag that significance depends on model choice. Call for large trials.

↓

Honest uncertainty. Large trials prioritized. True answer emerges faster.

OUTCOME: Premature babies spared unnecessary intervention

THE REVELATION

When a conclusion changes depending on whether you use fixed-effect or random-effects, the conclusion is fragile. Report both. Acknowledge uncertainty. And remember: a fragile result from small trials is not a mandate to change practice.

Module 7 Quiz

1. Why did the magnesium meta-analysis show benefit that ISIS-4 did not find?

A. ISIS-4 methodology was flawed

B. Calculation error in meta-analysis

C. Publication bias in small trials

D. LIMIT-2 was underpowered

2. What warning sign should have alerted reviewers to potential bias?

A. Asymmetric funnel plot (small trials showing larger effects)

B. Low heterogeneity (I² = 0%)

C. Strong biological plausibility

D. Too few trials to analyze

3. When publication bias is suspected, which model may amplify the bias?

A. Fixed effect model

B. Random effects model

C. Bayesian model

D. Network meta-analysis

Small trials can show false signals.

Large trials anchor the truth.

Heterogeneity is a message, not noise.

Module 8: The Heterogeneity

Heterogeneity is a message, not noise.

ACCORD: 2008

When the average hides the truth.

Module 8: The Heterogeneity

🎯 Learning Objectives

Calculate and interpret I², τ², and prediction intervals
Apply ICEMAN criteria to assess subgroup credibility
Distinguish between clinical, methodological, and statistical heterogeneity
Conduct and interpret leave-one-out sensitivity analyses
Explain how ACCORD revealed differential effects across subgroups

The Year: 2008

"You are about to witness one of the most shocking trial terminations in history..."

For decades, the diabetes community had one guiding principle: lower blood sugar is better. The landmark DCCT (1993) and UKPDS (1998) showed that intensive glucose control reduced microvascular complications—blindness, kidney failure, nerve damage.

The logical extrapolation:

If controlling glucose prevents complications, shouldn't intensive control prevent cardiovascular disease too?

ACCORD: Action to Control Cardiovascular Risk in Diabetes

The definitive test of intensive glucose control

10,251

Type 2 diabetics

HbA1c <6%

Intensive target

HbA1c 7-7.9%

Standard target

All patients had Type 2 diabetes with high cardiovascular risk—either established cardiovascular disease or multiple risk factors. The trial was designed for 5.6 years.

February 6, 2008

The Data Safety Monitoring Board calls an emergency meeting.

After 3.5 years, they make an unprecedented decision:

STOP THE TRIAL.

The Shocking Results

Outcome	Intensive	Standard	HR (95% CI)
Primary CV endpoint	352 events	371 events	0.90 (0.78–1.04)
All-cause mortality	257 deaths	203 deaths	1.22 (1.01–1.46)
Severe hypoglycemia	10.5%	3.5%	3.0× higher

22% increase in mortality

54 excess deaths in the intensive arm

🔍

Investigation Exercise: The Clinician's Dilemma

You are an endocrinologist with 500 diabetic patients. The ACCORD results are published. What do you tell your patients who have been striving for HbA1c <6%?

Is intensive control harmful for everyone? Or only for some?

Subgroup Analysis Revealed:

Subgroup	Intensive HR	Interpretation
No prior CVD	1.00 (0.76–1.32)	No effect
Prior CVD	1.45 (1.15–1.84)	Significant harm
Baseline HbA1c <8%	1.02 (0.75–1.40)	No effect
Baseline HbA1c ≥8%	1.29 (1.03–1.60)	Harm

The average effect masked critical heterogeneity!

For patients with established CVD or poor baseline control, intensive therapy was harmful.

Understanding Heterogeneity: I² and Beyond

When studies (or subgroups) show different effects, we must quantify this variation.

I² = 0–25%: Low heterogeneity. Effects are consistent across studies.

I² = 25–50%: Moderate. Look for sources of variation.

I² = 50–75%: Substantial. Consider whether pooling is appropriate.

I² = 75–100%: Considerable. A single pooled estimate may mislead.

But I² alone doesn't tell you what to do—it signals that you need to investigate further.

Tau² (τ²): The Between-Study Variance

While I² tells you the proportion of variance due to heterogeneity, τ² tells you the magnitude.

I² (percentage)

"What fraction of the total variance is due to true differences between studies?"

Scale: 0% to 100%

τ² (absolute)

"How much do the true effects vary between studies?"

Same scale as the effect measure

Use τ² to calculate prediction intervals

A prediction interval shows the range of effects you'd expect in a new study—often much wider than the confidence interval.

📊

The Prediction Interval: What ACCORD Really Tells Us

Consider a meta-analysis of intensive glucose control across multiple trials...

Confidence Interval

HR 1.10 (0.95–1.27)

"Our best estimate of the average effect"

Prediction Interval

HR 1.10 (0.70–1.73)

"The range of effects in a new setting"

The prediction interval spans both benefit and harm!

In some settings, intensive control might help. In others, it could kill.

When Is a Subgroup Effect Credible?

Subgroup Credibility Criteria (adapted from ICEMAN, Schandelmaier 2020 & Sun 2012)

1

Was the subgroup analysis pre-specified?

Post-hoc subgroups are prone to data dredging

2

Is there a plausible biological rationale?

Mechanism should be clear and independent of the data

3

Is the effect consistent across related outcomes?

If harm appears for mortality, is there similar harm for MI, stroke?

4

Is there independent replication?

Has the subgroup effect been confirmed in other studies?

ICEMAN Applied to ACCORD

Criterion	Assessment	Score
Pre-specified?	Yes—prior CVD was in the protocol	✓
Biological rationale?	Yes—hypoglycemia more dangerous with CVD	✓
Consistent outcomes?	Yes—CV mortality and all-cause mortality aligned	✓
Independent replication?	Partially—ADVANCE, VADT showed similar patterns	~

ICEMAN Rating: High Credibility

The differential harm in high-risk patients appears genuine.

The Clinical Implications

For patients without CVD: Moderate glucose control (HbA1c ~7%) remains the goal. Intensive control may reduce microvascular complications.

For patients with established CVD: Avoid intensive targets. Hypoglycemia is dangerous for damaged hearts.

For elderly patients: Relaxed targets. Quality of life matters. Tight control causes falls, confusion, and excess mortality.

"One size fits all" treatment is not patient-centered medicine.

Meta-Regression: Explaining Heterogeneity

When heterogeneity is high, meta-regression can identify study-level covariates that explain variation.

THE QUESTION

Does the effect size vary systematically with study characteristics?

Covariates

Year, dose, duration, baseline risk, study quality

Output

Regression coefficient (slope), R², residual heterogeneity

Caution

Meta-regression requires ≥10 studies per covariate. With few studies, it is exploratory only. Ecological fallacy: study-level associations may not apply to individuals.

Example: In ACCORD, meta-regression might test if treatment effect varies by baseline HbA1c, showing harm concentrated in patients with very high levels.

What number saves lives? Who decides?

REAL DATA

For decades, the target was: treat blood pressure to <140 mmHg systolic. Then came SPRINT (2015): 9,361 high-risk patients randomized to intensive (<120) vs standard (<140) targets. Intensive treatment reduced CV events by 25% and death by 27%. Trial stopped early for benefit. Guidelines changed worldwide.

Before SPRINT: The Guidelines Committee

You are setting blood pressure guidelines in 2014. The target has been <140 for years. Should you wait for better evidence?

PATH A: Maintain Status Quo

Keep <140 target (established practice, minimal controversy)

↓

Guidelines unchanged. Physicians continue treating to <140.

OUTCOME: Miss opportunity to prevent deaths

PATH B: Fund the Definitive Trial

Wait for SPRINT results before updating targets

↓

SPRINT demonstrates benefit. Update target to <120 for high-risk patients.

OUTCOME: Estimated 100,000+ lives saved globally

JNC 7 (2003): <140

Years of uncertainty

SPRINT (2015): <120 for high-risk

THE REVELATION

"Standard of care" is not fixed. It changes when trials challenge assumptions. For a decade, patients may have been undertreated because no one tested the obvious question.

Module 8 Quiz

1. Why was the ACCORD trial stopped early?

A. Intensive control showed clear cardiovascular benefit

B. Intensive control increased mortality

C. Enrollment was too slow

D. Budget ran out

2. What does a prediction interval tell us that a confidence interval doesn't?

A. The true effect is more precisely estimated

B. The sample size is adequate

C. The range of effects we'd expect in a new study

D. The mathematical formula used

3. According to ICEMAN, which factor is MOST important for subgroup credibility?

A. Pre-specification of the subgroup hypothesis

B. Large sample size in the subgroup

C. Statistically significant p-value

D. Multiple outcomes showing same direction

When studies disagree,

listen to the disagreement.

Heterogeneity is a message, not noise.

Absence of evidence is not evidence of absence.

Module 9: The Hidden Studies

Absence of evidence is not evidence of absence.

Reboxetine: 2010

The 74% that never saw light.

Module 9: The Hidden Studies

🎯 Learning Objectives

Interpret funnel plots for asymmetry detection
Apply Egger's test and other statistical tests for publication bias
Implement trim-and-fill method for bias adjustment
Critically appraise the limitations of publication bias tests
Apply the principle: "Absence of evidence is not evidence of absence"

The Year: 1997

"A new hope for depression patients who cannot tolerate SSRIs..."

Reboxetine (Edronax) was a novel antidepressant—a selective norepinephrine reuptake inhibitor (NRI). Unlike SSRIs, it targeted a different neurotransmitter system. For patients who failed or couldn't tolerate fluoxetine or sertraline, it offered a new mechanism.

1997

EU approval

50+

Countries approved

Millions

Prescriptions written

The Published Evidence

What doctors could find in medical journals:

Comparison	Published Trials	Published Result
Reboxetine vs Placebo	3 trials (n=507)	Significantly better (SMD = 0.56)
Reboxetine vs SSRIs	4 trials (n=628)	Equivalent or better

The published literature told a clear story:

Reboxetine works. Patients benefit. Prescribe with confidence.

But what about the trials you couldn't see?

In 2010, German researchers at IQWiG made a request to the European Medicines Agency...

They demanded access to all trial data—published and unpublished.

What they found changed everything.

The Complete Picture

Eyding et al., BMJ 2010

Comparison	Published Only	ALL DATA
Reboxetine vs Placebo	SMD 0.56 (benefit)	SMD 0.10 (no benefit)
Patients in analysis	507 (14%)	2,731 (100%)
Reboxetine vs SSRIs	Equivalent	Inferior (RR 1.23 for harm)
Patients in analysis	628 (26%)	2,411 (100%)

74% of patient data was never published

The hidden trials showed no benefit and more harm

🔍

Investigation Exercise: The File Drawer

You are a systematic reviewer in 2008. You search PubMed, Embase, and the Cochrane Library for all reboxetine trials. You find 7 published trials showing benefit.

Can you trust this evidence?

⚠️ The funnel is drastically asymmetric!

All published studies cluster on one side. Where are the null and negative trials?

The Publication Bias Toolkit

1

Funnel Plot

Plot effect size vs. standard error. A symmetric funnel suggests no bias; asymmetry raises alarms.

2

Egger's Regression Test

Regress effect/SE on 1/SE. A non-zero intercept (P < 0.10) suggests small-study effects. Note: inflated false-positive rate with binary outcomes; use Peters' test instead.

3

Peters' Test

For binary outcomes, regresses log OR on inverse of total sample size. Less prone to false positives.

4

Trim-and-Fill

Imputes "missing" studies to make the funnel symmetric, then recalculates the pooled effect.

📊

Interactive: Trim-and-Fill Analysis

Let us apply trim-and-fill to the reboxetine data and see what the adjusted estimate would be...

Published Only

7 trials

SMD = 0.56

Significant benefit

Trim-and-Fill

7 + 5 imputed = 12 trials

SMD = 0.23

Reduced, still nominally significant

But even trim-and-fill underestimated the problem!

The true effect with all data was SMD = 0.10 (essentially null).
Trim-and-fill is conservative—it doesn't fully correct for selective publication.

The Best Defense: Trial Registries

Publication bias detection methods are imperfect. The real solution is prospective registration.

ClinicalTrials.gov

US registry (2000)

WHO ICTRP

Global portal

PROSPERO

Review registration

When searching for trials, always check registries. Compare the number of registered trials to the number published. The gap is your warning signal.

Since 2005, ICMJE requires trial registration as a condition of publication.

The AllTrials Campaign

"All trials registered. All results reported."

The reboxetine scandal, along with similar cases in other drugs, catalyzed a global movement:

✓

2013: EMA Clinical Data Policy

European Medicines Agency commits to publishing clinical study reports

✓

2016: FDA Amendments Act enforcement

Mandatory results reporting on ClinicalTrials.gov within 12 months

✓

AllTrials Coalition

Over 90,000 supporters, 700+ organizations demanding transparency

The Reboxetine Aftermath

!

Germany's IQWiG recommended against reboxetine for depression

!

The UK's NICE downgraded it to "not recommended"

!

The FDA had rejected reboxetine in 2001 (they had access to unpublished data)

For over a decade, patients received a drug no better than placebo.

Because only the positive trials were published.

What if the published conclusion is the opposite of the actual data?

REAL DATA

GlaxoSmithKline's Study 329 tested paroxetine in adolescent depression. The published paper (2001) concluded paroxetine was "generally well tolerated and effective." The actual data: paroxetine failed on all 8 pre-specified outcomes. When re-analyzed (RIAT 2015), suicidal/self-harm events: 23 in the paroxetine group vs 5 on placebo. The published paper re-defined outcomes post-hoc to manufacture significance. In 2015, a RIAT (Restoring Invisible and Abandoned Trials) re-analysis using the original clinical study report concluded: paroxetine was neither safe nor effective for adolescents.

The Prescriber's Puzzle: 2003

You are a child psychiatrist. Study 329 — the only large trial — says paroxetine works in teens. But the FDA has not approved it for adolescents. A parent asks you to prescribe it. What do you do?

PATH A: Trust the Publication

A peer-reviewed JAACAP paper says it works. Prescribe off-label.

↓

Millions of prescriptions worldwide. Suicidal events in adolescents.

OUTCOME: FDA issues black box warning for SSRIs in youth (2004)

PATH B: Check the Trial Registry

Search ClinicalTrials.gov for original endpoints. Notice the published outcomes don't match the registered protocol.

↓

Red flag: outcome switching detected. You withhold the drug. Patient is safer.

OUTCOME: Publication bias identified before harm

THE REVELATION

Publication bias is not just about missing studies. It is about missing truth within published studies. Outcome switching, ghost-writing, and selective reporting can turn a failed trial into a marketing tool. Always compare published results against trial registry protocols.

Module 9 Quiz

1. What percentage of reboxetine trial data was hidden from the published literature?

A. 25%

B. 50%

C. 74%

D. 90%

2. Why can trim-and-fill underestimate the correction needed?

A. It assumes effects are normally distributed

B. It only imputes studies to achieve symmetry, which may not fully reflect reality

C. It requires at least 20 studies

D. It only works with very large studies

3. What is the best prospective defense against publication bias?

A. Funnel plots in all meta-analyses

B. Egger's test before pooling

C. Prospective trial registration

D. More medical journals

What you cannot see

may be more important than what you can.

Absence of evidence is not evidence of absence.

Certainty must be earned, not assumed.

Module 10: The Certainty

Certainty must be earned, not assumed.

Early Surfactant: 2012

When high-quality evidence evolves.

Module 10: The Certainty

🎯 Learning Objectives

Apply the complete GRADE framework to assess certainty of evidence
Evaluate all five downgrade factors (RoB, inconsistency, indirectness, imprecision, publication bias)
Identify when to upgrade for large effect, dose-response, or confounding
Construct Summary of Findings tables with absolute effect estimates
Apply the principle: "Certainty must be earned, not assumed"

The Year: 1990s

"A revolution in neonatal care..."

Respiratory Distress Syndrome (RDS) was the leading cause of death in premature infants. The development of exogenous surfactant—the substance that keeps alveoli from collapsing—was one of the great advances in neonatal medicine.

The question became: When should we give surfactant?

Prophylactically (to all high-risk infants) or selectively (only after RDS develops)?

The Original Cochrane Review (2003)

Multiple RCTs conducted before the era of routine CPAP

Outcome	Prophylactic vs Selective	Certainty
Neonatal mortality	RR 0.73 (favors prophylactic)	High
BPD or death	RR 0.84 (favors prophylactic)	High

Recommendation: Give surfactant prophylactically

Guidelines worldwide adopted this approach

But the world of neonatal care was changing...

A new technology emerged: Continuous Positive Airway Pressure (CPAP)

Non-invasive support that could help preterm lungs without intubation.

Would the old evidence still apply?

The 2012 Cochrane Update

New trials conducted in the CPAP era

Outcome	Old Trials	New Trials
BPD or death	RR 0.84 (favors prophylactic)	RR 1.12 (favors selective)
Need for mechanical ventilation	Lower with prophylactic	Higher with prophylactic!

Complete Reversal

In the CPAP era, prophylactic surfactant causes more harm

🔍

Investigation: Why Did Evidence Evolve?

You are a neonatologist. A colleague asks: "How can randomized trials contradict each other?"

Was the original evidence wrong?

1

Indirectness Changed

Old trials: No CPAP available. New trials: CPAP standard of care.

2

The Comparator Improved

Selective surfactant + CPAP is better than prophylactic intubation.

3

Context Matters

Evidence from one era may not apply to another.

This is why GRADE assesses Indirectness!

High-quality evidence can become inapplicable when context changes.

The GRADE Framework

Grading of Recommendations, Assessment, Development and Evaluations

GRADE answers the question: How confident are we in this estimate?

⊕⊕⊕⊕ HIGH: Very confident. True effect is close to the estimate.

⊕⊕⊕◯ MODERATE: Moderately confident. True effect likely close, but may differ substantially.

⊕⊕◯◯ LOW: Limited confidence. True effect may differ substantially.

⊕◯◯◯ VERY LOW: Very little confidence. True effect likely substantially different.

GRADE: Factors That Downgrade Certainty

RCT evidence starts at HIGH. It can be downgraded for:

1

Risk of Bias

Flawed randomization, lack of blinding, incomplete follow-up, selective reporting

2

Inconsistency

Unexplained heterogeneity across studies (large I², non-overlapping CIs)

3

Indirectness

Differences in population, intervention, comparator, or outcomes from the question

4

Imprecision

Wide confidence intervals, small sample size, few events

GRADE: The Fifth Factor

5

Publication Bias

Asymmetric funnel plot, missing registered trials, sponsor influence

Each factor can downgrade by one or two levels

High → Moderate → Low → Very Low

Example: A meta-analysis of RCTs (starts HIGH) with high risk of bias (↓1) and serious indirectness (↓1) would be rated LOW.

📊

Interactive: Apply GRADE to Surfactant

Let's rate the certainty of evidence for prophylactic surfactant using old vs. new trials.

OLD TRIALS (Pre-CPAP)

Starting: HIGH (RCTs)

Risk of Bias: Low (−0)

Inconsistency: None (−0)

Indirectness: Serious (−1)

Different standard of care today

Final: ⊕⊕⊕◯ MODERATE

NEW TRIALS (CPAP Era)

Starting: HIGH (RCTs)

Risk of Bias: Low (−0)

Inconsistency: None (−0)

Indirectness: None (−0)

Matches current practice

Final: ⊕⊕⊕⊕ HIGH

GRADE: Factors That Upgrade Certainty

Observational evidence starts at LOW. It can be upgraded for:

+1

Large Magnitude of Effect

RR >2 or <0.5 with no plausible confounding

+1

Dose-Response Gradient

Higher exposure = larger effect in a consistent pattern

+1

Residual Confounding

All plausible confounders would reduce the effect (strengthens causal inference)

Communicating Certainty

GRADE requires transparent language about confidence:

HIGH: "Prophylactic surfactant reduces mortality..."

MODERATE: "Prophylactic surfactant probably reduces mortality..."

LOW: "Prophylactic surfactant may reduce mortality..."

VERY LOW: "We are uncertain whether prophylactic surfactant reduces mortality..."

This language ensures clinicians understand the strength of the evidence.

Can too much of a lifesaver become a killer?

REAL DATA

1940s-50s: High oxygen concentrations saved premature babies from respiratory failure. Then came an epidemic of blindness—retrolental fibroplasia (now called ROP). Doctors reduced oxygen dramatically. Blindness dropped. But then: increased deaths and brain damage from hypoxia. The optimal oxygen level required decades of trials to find. Recent SUPPORT/BOOST II trials finally defined the therapeutic window: SpO2 91-95%.

The Neonatologist's Dilemma: 1955

You are a neonatologist. Premature babies on high oxygen are going blind. What do you do?

PATH A: Dramatic Reduction

Drastically reduce oxygen to prevent blindness

↓

Blindness rates drop. But some babies die or suffer brain damage from hypoxia.

OUTCOME: Trading one harm for another

PATH B: Systematic Study

Carefully titrate oxygen, study the dose-response relationship

↓

Takes decades but eventually identifies the optimal range.

OUTCOME: Optimize both survival and vision

1940s: High O2 saves lives

1950s: Blindness epidemic

1960s-70s: Deaths from low O2

2010s: SUPPORT/BOOST define optimal range

THE REVELATION

Every intervention has a therapeutic window. Finding it requires measurement, not assumption. The pendulum swung for 60 years before evidence defined the balance.

Module 10 Quiz

1. Why did the surfactant recommendation reverse between 2003 and 2012?

A. The original trials were fraudulent

B. CPAP changed the comparator (indirectness)

C. Not enough patients in original trials

D. The outcome was measured differently

2. Which of the following is NOT a GRADE downgrade factor?

A. Risk of bias

B. Imprecision

C. Publication bias

D. Large magnitude of effect

3. What language should be used for LOW certainty evidence?

A. "The intervention reduces..."

B. "The intervention probably reduces..."

C. "The intervention may reduce..."

D. "We are uncertain whether..."

A number is not enough.

You must communicate how certain you are.

Certainty must be earned, not assumed.

Methods protect patients from our confidence.

Module 11: The Living Review

Methods protect patients from our confidence.

COVID-19 Hydroxychloroquine: 2020

When urgency met evidence.

Module 11: The Living Review

🎯 Learning Objectives

Apply Trial Sequential Analysis to determine when evidence is sufficient
Design and maintain a living systematic review
Establish update triggers and futility/harm boundaries
Manage multiplicity and alpha-spending in sequential analyses
Explain how rapid evidence synthesis evolved during COVID-19

March 2020: A World in Crisis

"The virus spreads faster than our understanding..."

COVID-19 was killing thousands. ICUs overflowed. There was no vaccine, no treatment. Then a glimmer of hope: hydroxychloroquine (HCQ)—an old malaria drug—showed antiviral activity in lab studies.

March 20

Gautret study (France)

36 pts

Non-randomized

Viral

Clearance improved

The Rush to Adopt

Within weeks of the Gautret study:

!

March 28: FDA issues Emergency Use Authorization for HCQ

!

April 4: India bans HCQ export (hoarding fears)

!

Global: Shortages affect lupus and rheumatoid arthritis patients

Millions received HCQ based on a 36-patient observational study

What could go wrong?

🔍

Investigation: The Gautret Study

You are an EBM expert asked to evaluate the French HCQ study. Examine the design...

Issue	Impact
Non-randomized	Selection bias—who got HCQ?
6 patients excluded	3 went to ICU, 1 died, 1 withdrew, 1 had nausea
Surrogate outcome	Viral load, not clinical outcomes
Control from different hospital	Different care, different testing
No blinding	Expectation bias in lab testing

This study would score HIGH risk of bias on RoB 2.0

GRADE certainty: VERY LOW. Yet it changed global policy.

Why Observational COVID Studies Misled

1

Immortal Time Bias

Patients must survive long enough to receive treatment. Survivors are compared to non-survivors.

2

Confounding by Indication

Sicker patients may get different treatments. Healthier patients received HCQ early.

3

Healthy User Effect

Patients who seek treatment tend to be healthier overall.

4

Outcome Reporting

Studies with positive results got published faster.

June 2020: The RCTs Report

Large, rigorous trials completed at remarkable speed

Trial	N	Result
RECOVERY (UK)	4,716	No benefit on mortality (RR 1.09)
WHO SOLIDARITY	954	No benefit (RR 1.19)
ORCHID (US)	479	Stopped for futility

HCQ provided no benefit—and may have caused harm

June 15, 2020: FDA revokes Emergency Use Authorization

📊

Timeline: Observational vs. RCT Evidence

March-May 2020

Observational: ~20 studies

Suggest benefit

Pooled OR ~0.65

June-July 2020

RCTs: RECOVERY, SOLIDARITY

Show no benefit/harm

Pooled RR ~1.10

From "promising" to "ineffective" in 3 months

This is why we need randomization—and living reviews to track evolving evidence.

Living Systematic Reviews

A new approach for rapidly evolving evidence:

1

Continuous Surveillance

Search literature weekly or even daily for new evidence

2

Cumulative Meta-Analysis

Update pooled estimates as each new trial reports

3

Trial Sequential Analysis (TSA)

Determine when sufficient information has accumulated to conclude

4

Transparent Versioning

Track every change, maintain full audit trail

Trial Sequential Analysis (TSA)

When have we learned enough?

TSA applies stopping boundaries to meta-analysis—similar to interim analysis in a single trial. It accounts for the required information size (RIS) needed to detect or exclude a clinically meaningful effect.

RIS

Required sample size

α-spending

Controls type I error

Boundaries

Benefit / Harm / Futility

For HCQ in COVID, TSA showed futility boundary was crossed by June 2020.

Lessons from the HCQ Saga

1. Observational studies can mislead spectacularly when bias is prevalent. Even many studies pointing the same direction can be wrong.

2. RCTs can be conducted quickly when the will exists. RECOVERY enrolled 5,000+ patients in weeks.

3. Living reviews are essential for evolving topics. Fixed-point-in-time reviews become obsolete instantly.

4. Political pressure doesn't change biology. Rigorous methods protect patients even when under pressure.

What if the prevention IS the cause?

REAL DATA

For decades, pediatric guidelines recommended: avoid peanuts in infancy to prevent allergy. Meanwhile, peanut allergy rates tripled from 1997 to 2008. Then came LEAP (2015): 640 high-risk infants randomized to early peanut introduction vs. avoidance. Result: Early introduction reduced peanut allergy by 81% (1.9% vs 13.7%). The prevention strategy was causing the epidemic.

The Allergist's Crossroads: 2010

You are a pediatric allergist. Peanut allergies are rising despite avoidance guidelines. Do you question the dogma?

PATH A: Follow Guidelines

Continue recommending peanut avoidance in high-risk infants

↓

Guidelines are "evidence-based." Safe to follow consensus.

OUTCOME: Peanut allergies continue to rise

PATH B: Question the Dogma

Design a trial to test if early introduction might be protective

↓

LEAP trial reveals the truth. Guidelines reverse worldwide.

OUTCOME: Prevent an epidemic

2000: AAP recommends avoidance

2008: Allergy rates triple

2015: LEAP reverses the evidence

2017: Guidelines flip to early introduction

THE REVELATION

"First, do no harm" requires evidence. Assumptions, even well-intentioned ones, can cause harm at scale. The immune system needed exposure to develop tolerance—avoidance created sensitization.

Module 11 Quiz

1. What was the main flaw in the Gautret hydroxychloroquine study?

A. Too few patients

B. No blinding

C. Excluding patients who deteriorated

D. Too short follow-up

2. What does Trial Sequential Analysis help determine?

A. Which studies have high risk of bias

B. When enough evidence has accumulated

C. The degree of heterogeneity

D. Which treatment is best

3. Why did observational COVID studies show HCQ benefit while RCTs did not?

A. RCTs enrolled sicker patients

B. RCTs used different outcomes

C. Bias in observational studies

D. Observational studies had better data

Speed cannot replace rigor.

But rigor can be fast.

Living reviews balance both.

Not every signal is truth.

Module 12: Advanced Methods

Not every signal is truth.

Advanced Methods

Beyond pairwise meta-analysis.

Module 12: Advanced Methods

🎯 Learning Objectives

Interpret network meta-analysis geometry and SUCRA rankings
Apply bivariate models for diagnostic test accuracy meta-analysis
Conduct dose-response meta-analysis with flexible splines
Understand when individual patient data (IPD) meta-analysis is needed
Recognize the assumptions and limitations of each advanced method

When Pairwise Is Not Enough

"Sometimes the question is more complex than A versus B..."

The methods you've learned form the foundation. But clinical reality often demands more: Which of 10 antidepressants is best? What's the optimal dose of statin? Does this test accurately diagnose early cancer?

This module introduces four advanced methods—each answering different complex questions.

Network Meta-Analysis (NMA)

When you have many treatments but few head-to-head trials

NMA combines direct evidence (A vs B) with indirect evidence (A vs C, B vs C → inferred A vs B) to compare multiple treatments simultaneously.

SUCRA

Ranking probabilities, not effect size

Consistency

Direct = Indirect?

Networks

Visualize evidence

🔍

NMA Example: Antidepressants

The landmark Cipriani 2018 NMA compared 21 antidepressants using 522 trials.

The Challenge

21 drugs, but not every pair tested head-to-head

Many vs. placebo, few vs. each other

The Solution

NMA combines direct and indirect evidence across the network

Produces comparative estimates and rankings, with uncertainty and assumptions made explicit

Result: Some drugs ranked higher for efficacy, others for acceptability

No single drug is universally "best"; interpret rankings with credible intervals, transitivity, and clinical trade-offs.

NMA: Critical Assumptions

1

Transitivity

Effect modifiers should be similarly distributed across comparisons; otherwise indirect comparisons may be biased

2

Consistency

Direct and indirect estimates should agree within uncertainty; test globally and locally, then investigate discrepancies

3

Connected Network

All treatments linked through at least one common comparator

When assumptions fail, NMA can mislead

Always assess transitivity and test for inconsistency.

Dose-Response Meta-Analysis

Finding the optimal dose

Uses the Greenland-Longnecker method with restricted cubic splines to model non-linear relationships between dose and effect.

1

Non-linear patterns

J-shaped (alcohol & mortality), U-shaped (vitamin D), threshold (aspirin)

2

Clinical relevance

Find dose with best benefit-harm balance, not just "more is better"

Individual Patient Data (IPD)

The gold standard for subgroup analysis

Instead of published summary data, obtain raw patient-level data from trialists. Enables precise subgroup analyses, time-to-event modeling, and standardized definitions.

One-Stage

Single hierarchical model (not mega-trial)

Two-Stage

Analyze, then pool

80%+ target

Data availability goal

The Early Breast Cancer Trialists' Collaborative Group pioneered IPD MA in the 1980s.

Diagnostic Test Accuracy (DTA)

When the "intervention" is a test

DTA meta-analysis synthesizes sensitivity (true positive rate) and specificity (true negative rate)—two correlated outcomes requiring bivariate models.

1

Bivariate/HSROC Model

Accounts for correlation between sensitivity and specificity

2

SROC Curve

Summary ROC curve with 95% confidence and prediction regions

3

QUADAS-2

Quality Assessment of Diagnostic Accuracy Studies

Choosing the Right Method

Question	Method
Does A beat B?	Pairwise MA
Which of many treatments is best?	Network MA (NMA)
What is the optimal dose?	Dose-Response MA
Who benefits most? (subgroups)	IPD MA
How accurate is this test?	DTA MA
How does the effect evolve over time?	Survival/Time-to-Event MA

The method must match the question. Never force a question into the wrong method.

Three large trials. Three different answers. What do you believe?

REAL DATA

CORTICUS (2008): 499 patients. Hydrocortisone in septic shock. No mortality benefit. ADRENAL (2018): 3,658 patients. Hydrocortisone. No mortality benefit. APROCCHSS (2018): 1,241 patients. Hydrocortisone + fludrocortisone. Mortality reduced (43% vs 49.1%, p=0.03). Same class of intervention. Different protocols. Different results.

The Guidelines Writer's Challenge

You are writing sepsis guidelines. Three major trials disagree. How do you recommend?

PATH A: Simple Average

Pool all three trials. Overall effect uncertain. Conclude "evidence unclear."

↓

Guidelines say steroids are optional. No strong recommendation.

OUTCOME: Clinicians left without clear guidance

PATH B: Investigate Heterogeneity

Analyze why APROCCHSS differed (fludrocortisone, longer duration, different population)

↓

Identify that the effective protocol differs from the ineffective ones.

OUTCOME: Recommend the specific effective protocol

THE REVELATION

Conflicting trials are not failures. They are maps of where the treatment works and where it does not. The differences between trials—dose, duration, co-interventions, population—are the key to understanding.

Module 12 Quiz

1. What is the key advantage of Network Meta-Analysis over pairwise?

A. It doesn't require data extraction

B. It compares treatments not directly tested against each other

C. It eliminates the need for risk of bias assessment

D. It produces better forest plots

2. Why does DTA meta-analysis require bivariate models?

A. To handle more than two studies

B. To adjust for publication bias

C. Sensitivity and specificity are correlated

D. To generate forest plots

3. What does the "consistency" assumption in NMA require?

A. All studies must be high quality

B. Direct and indirect evidence must agree

C. Sample sizes must be similar

D. No missing studies

Methodologist

The Course Ecosystem

This course covers the full systematic review workflow. For deep dives, explore the companion courses:

DTA Course
Bivariate/HSROC, SROC curves, QUADAS-2

Risk of Bias Mastery
RoB 2, ROBINS-I/E, domain-level assessment

GRADE Certainty
Full SoF tables, GRADE-CERQual

IPD Meta-Analysis
One-stage/two-stage, mixed-effects models

Publication Bias Detective
Copas, PET-PEESE, p-curve, selection models

Umbrella Reviews
AMSTAR 2, ROBIS, overlap correction

Prognostic Reviews
CHARMS, PROBAST, c-statistic pooling

Living Reviews + Rapid Reviews
TSA, update triggers, abbreviated methods

Module 12 Complete

"The method must match the question. Advanced methods answer advanced questions—but the fundamentals never change."

You have mastered the core workflow. The next ten modules explore the frontier: Bayesian inference, network meta-analysis, individual patient data, dose–response modelling, robustness and fragility, equity, AI-assisted synthesis, qualitative evidence, multivariate methods, and reproducibility.

Not every signal is truth.

Module 13: The Bayesian Turn

Not every signal is truth.

Module 13: The Bayesian Turn

🎯 Learning Objectives

Explain the difference between frequentist and Bayesian inference
Interpret prior distributions, likelihoods, and posterior distributions
Distinguish credible intervals from confidence intervals
Understand when Bayesian meta-analysis offers advantages
Recognize how prior choice affects conclusions

In 2005, a trial began

that would never truly end.

The STAMPEDE trial for prostate cancer used a multi-arm, multi-stage (MAMS) platform design. Arms could be added or dropped as evidence accumulated. Though its statistics were frequentist, the adaptive philosophy embodied the Bayesian spirit: updating decisions as data accumulate.

The Frequentist Worldview

In frequentist statistics, probability means long-run frequency. A 95% CI does NOT mean "95% probability the true effect is inside." It means: if we repeated the study infinitely, 95% of intervals would contain the truth.

p-value

P(data | H₀), not P(H₀ | data)

95% CI

Coverage property, not belief

Fixed

The true parameter is fixed

The Bayesian Worldview

In Bayesian statistics, probability represents degree of belief. We start with a prior (what we believe before data), update with the likelihood (what data tells us), and get a posterior (updated belief).

1

Prior × Likelihood = Posterior

Bayes' theorem: P(θ|data) ∝ P(data|θ) × P(θ)

2

Credible Intervals

A 95% credible interval is probabilistically interpretable, conditional on the specified model and prior.

Researcher

Choosing Priors

1

Non-informative (Vague)

Very diffuse priors can reduce prior influence but may be unstable on some scales; weakly informative priors are often safer defaults.

2

Weakly Informative

Normal(0, 1) for log-OR. Regularizes extreme estimates while remaining flexible.

3

Informative

Based on previous evidence. Powerful but controversial. Must be pre-specified.

4

Half-Cauchy for τ

A commonly used weakly informative prior family for heterogeneity. Half-Cauchy(0, 0.5) allows large τ but concentrates near zero.

Researcher

MCMC Sampling

Most Bayesian models cannot be solved analytically. We use Markov Chain Monte Carlo (MCMC) to draw samples from the posterior. Tools: JAGS, Stan, brms (R), PyMC (Python).

Chains

Multiple independent chains (typically 4)

R̂

Convergence: R̂ < 1.01 (strict; older texts use < 1.1)

ESS

Bulk-ESS > 400 for means; tail-ESS > 400 for CIs

Methodologist

Bayesian Model Averaging

Instead of choosing between fixed-effect and random-effects models, Bayesian model averaging (BMA) weights each model by its posterior probability. This accounts for model uncertainty in the final estimate.

BF

Bayes Factors

Interpretation bands are conventions (not absolutes) and can change with prior choices; always report sensitivity to priors.

Interactive: Posterior Visualizer

Adjust the prior strength to see how it affects the posterior. Watch how more data overwhelms the prior.

Prior Strength: Vague

Prior Mean (log-OR): 0.00

The STAMPEDE Story

STAMPEDE launched in 2005 with 5 research arms comparing treatments for advanced prostate cancer. By 2016 it had added abiraterone and shown a 37% reduction in death (HR 0.63, 95% CI 0.52–0.76).

The platform design embodies Bayesian adaptive thinking: interim analyses guide arm selection, new arms can enter as treatments emerge, and futile arms drop early—saving patients from ineffective therapies.

STAMPEDE enrolled over 10,000 patients across 100+ centers and fundamentally changed prostate cancer care. The Bayesian mindset let evidence accumulate and inform decisions in real time.

Decision Tree: When to Go Bayesian?

Frequentist vs Bayesian Meta-Analysis

Choose Bayesian when: (1) you have genuine prior information, (2) you need probabilistic statements ("80% chance effect > 0"), (3) few studies make frequentist properties unreliable, or (4) you want to do model averaging.

Bayesian with weakly informative prior

A common practical default. Regularizes extreme estimates without forcing strong prior conclusions.

Bayesian with informative prior

Only when prior evidence is strong and pre-specified. Must do sensitivity analysis.

Stay frequentist

Simpler, well-understood. Preferred when k is large and no prior information.

Remember Module 1?

CAST Through a Bayesian Lens

Had a Bayesian analysis of CAST used an informative prior from basic science (antiarrhythmics suppress PVCs), the posterior would still have shifted strongly toward harm. With enough data, even a strong prior yields to the likelihood. The lesson: Bayesian methods don't protect against bad priors—but they make the assumptions transparent.

Module 13 Quiz

Q1. What does a 95% Bayesian credible interval mean?

A. 95% of repeated experiments would produce intervals containing the true value

B. Given the model and prior, there is a 95% posterior probability the parameter lies in this interval

C. The interval has a 95% chance of being correct

D. 95% of future data will fall in this range

Q2. Which is a commonly used weakly informative prior family for between-study heterogeneity (τ)?

A. Uniform(0, 100)

B. Normal(0, 1)

C. Half-Cauchy(0, 0.5)

D. Fixed at 0.5

Module 13 Complete

"The Bayesian turn is not about mathematics. It is about honesty—making our assumptions visible."

Not every signal is truth.

Module 14: The Network

Methods protect patients from our confidence.

Module 14: The Network

🎯 Learning Objectives

Explain why pairwise comparisons are insufficient when many treatments exist
Interpret network geometry (nodes, edges, thickness)
Understand transitivity, consistency, and the role of indirect evidence
Interpret SUCRA rankings and league tables
Recognize when NMA assumptions are violated

A clinician faces a patient

with depression. Which drug?

There are 21 commonly prescribed antidepressants. Most head-to-head trials compare only 2 or 3. Cipriani et al. (2018, Lancet) connected 522 trials and 116,477 patients into a single network.

The Logic of Network Meta-Analysis

1

Direct Evidence

Trials directly comparing A vs B give the most reliable estimate.

2

Indirect Evidence

If A vs C and B vs C exist, we can infer A vs B. This is the "transitive" assumption.

3

Mixed Evidence

NMA combines both, weighted by precision, to rank all treatments simultaneously.

Interactive: Network Graph

Each node is a treatment. Edge thickness represents the number of studies comparing those two treatments.

Researcher

Transitivity & Consistency

Transitivity: Effect modifiers should be similarly distributed across comparisons, so indirect comparisons are clinically coherent.

Consistency: Agreement between direct and indirect estimates within uncertainty. Global (design-by-treatment interaction) and local (node-splitting) tests help identify potential inconsistency loops.

Researcher

SUCRA & P-scores

SUCRA

Surface Under Cumulative Ranking. Higher values indicate better ranking probability, not guaranteed superiority.

P-score

Frequentist analogue to ranking probability summaries. Interpret with effect sizes and uncertainty.

Caution: Ranking is seductive but misleading when differences between treatments are small or uncertain. Always report credible/confidence intervals alongside ranks.

Methodologist

Component NMA

When interventions are complex (e.g., behavioral + pharmacological), component NMA decomposes multi-component treatments to estimate the individual contribution of each component. Uses additive models: effect(A+B) = effect(A) + effect(B) + interaction.

The Cipriani Network

The 2018 Lancet analysis found that all 21 antidepressants were more effective than placebo. Amitriptyline, mirtazapine, and venlafaxine ranked highest for efficacy. Agomelatine, fluoxetine, and escitalopram ranked highest for acceptability (fewest dropouts).

No single drug "won" on all outcomes. The network revealed trade-offs invisible to pairwise analysis.

Decision Tree: Is NMA Appropriate?

NMA Feasibility Check

You have 15 RCTs comparing 6 different statins. Some pairs have direct evidence, others do not.

Check transitivity, then fit NMA

Verify that patient populations and study designs are sufficiently similar across comparisons.

Ignore indirect evidence

Loses statistical power and leaves gaps in the evidence base.

Pool all into one pairwise comparison

Violates the structure of the evidence. Statins are different drugs.

Module 14 Quiz

Q1. What assumption must hold for indirect evidence to be valid in NMA?

A. Transitivity — effect modifiers are balanced across comparisons

B. Homogeneity — I² must be below 25%

C. All studies must have similar sample sizes

D. All studies must be double-blinded

Module 14 Complete

"The network sees what pairwise comparisons cannot: the whole landscape of treatment choice."

Not every signal is truth.

Module 15: The Individual

What was hidden in plain sight?

Module 15: The Individual

🎯 Learning Objectives

Explain why aggregate data can mask treatment–covariate interactions
Distinguish one-stage from two-stage IPD models
Recognize ecological bias in aggregate meta-analysis
Understand the practical challenges of IPD collection
Interpret treatment–covariate interaction plots

For decades, breast cancer trials

published summaries. Not patients.

The Early Breast Cancer Trialists' Collaborative Group (EBCTCG) collected individual records from over 100,000 women across hundreds of trials. Their IPD meta-analyses showed that tamoxifen benefits depend heavily on estrogen receptor status—something invisible in aggregate data.

What the Summaries Concealed

Every published trial of tamoxifen reported an overall result. Across hundreds of studies, tamoxifen appeared to offer a modest benefit. But “modest benefit” was an average that concealed a profound truth.

The Hidden Subgroup Split

RR 0.59

ER-positive subgroup: 41% reduction in recurrence

RR 0.97

ER-negative subgroup: essentially no benefit at all

The overall pooled effect—mixing responsive and non-responsive patients—was a statistical fiction. A “modest” average that overstated benefit for one group and implied benefit where none existed for the other.

Aggregate vs Individual Patient Data

AD

Aggregate: published effect + CI only

IPD

Individual: raw patient-level records

IPD allows: (1) consistent outcome definitions, (2) subgroup analysis by patient characteristics, (3) time-to-event modelling, (4) checking for ecological bias. It is the gold standard for exploring treatment effect modification.

Researcher

One-Stage vs Two-Stage IPD

1

Two-Stage

Analyze each study separately, then combine estimates (like standard MA). Simple but loses information.

2

One-Stage

Fit a single mixed-effects model to all patient data simultaneously. More powerful for interactions and rare events.

Key: Both should account for study clustering. Never pool IPD as if from one mega-trial—this introduces confounding (Simpson's paradox).

Methodologist

Ecological Bias

A meta-regression using study-level mean age might show older patients benefit more. But this could be ecological bias—the study-level association does not reflect the patient-level truth. Only IPD can separate within-study from between-study effects.

When the Whole Lies About Its Parts

Simpson’s paradox: a trend that appears in aggregate data reverses when the data are grouped by a confounding variable.

The Paradox in Practice

A mega-trial analysis found Treatment X beneficial overall. But within each study, it was harmful. How? Differences in baseline risk between studies created an illusion—sicker populations happened to receive more treatment, inflating the aggregate benefit.

Cates (2002, BMJ) showed that pooling across studies without accounting for clustering can reverse the apparent direction of effect.

This is why IPD one-stage models include study as a clustering variable—to prevent between-study confounding from masquerading as treatment effect.

The EBCTCG Legacy

The EBCTCG's IPD meta-analyses have defined breast cancer treatment for 40 years. Their 2005 analysis of tamoxifen vs no treatment showed a clear benefit in ER-positive tumors (RR 0.59) but no benefit in ER-negative tumors (RR 0.97).

Without IPD, the overall aggregate effect would have been pooled across both groups— diluting the benefit and potentially denying ER-positive patients the magnitude of their gain.

Decision Tree: When Is IPD Worth Pursuing?

Do you suspect treatment–covariate interactions?

Yes →

Can you obtain IPD from >80% of trials?

Yes → One-stage IPD meta-analysis with interaction terms

No → Two-stage: request available IPD + aggregate for the rest

No →

Is ecological bias a concern?

Yes → IPD preferred even without interactions

No → Aggregate data meta-analysis may suffice

EBCTCG collected data from hundreds of trials over 40 years. Most IPD meta-analyses involve 5–20 trials. The decision depends on the question, not the ambition.

Methodologist

The Pattern Repeats

Remember Module 3? HRT appeared beneficial in observational studies but harmful in RCTs. The same aggregate masking occurred: overall benefit hid subgroup harm.

IPD analysis of the Women’s Health Initiative later showed that timing mattered—women starting HRT within 10 years of menopause had different outcomes than those starting later. The “timing hypothesis” was invisible in published aggregate summaries.

The lesson recurs: aggregate data can obscure critical treatment–covariate interactions. Whether it is ER status in breast cancer or timing in HRT, the individual-level data reveals what summaries conceal.

Module 15 Quiz

Q1. What is the primary advantage of IPD over aggregate-data meta-analysis?

A. It always includes more studies

B. It is cheaper and faster

C. It can explore treatment–covariate interactions without ecological bias

D. It eliminates the need for random-effects models

Module 15 Complete

"Behind every pooled estimate are individuals whose stories the aggregate cannot tell."

Heterogeneity is a message, not noise.

Module 16: The Dose

Heterogeneity is a message, not noise.

Module 16: The Dose

🎯 Learning Objectives

Explain why simple pairwise comparisons miss dose–response relationships
Distinguish linear, quadratic, and spline dose–response models
Interpret restricted cubic splines with knots
Identify threshold effects and J/U-shaped curves
Understand model comparison with AIC/BIC

For decades, moderate drinking

appeared to protect the heart.

The "J-shaped curve" showed non-drinkers had higher cardiovascular mortality than moderate drinkers. But Stockwell et al. (2016) demonstrated that the J-curve was an artefact of misclassifying former drinkers (who quit due to illness) as "abstainers."

A Scientific Consensus Built on Sand

By 2010, over 100 observational studies had confirmed the J-curve. Medical textbooks taught it. Cardiologists cited it. Wine industry lobbyists funded conferences around it.

100+

Observational studies confirming the J-curve

15–25%

Lower cardiovascular mortality in moderate drinkers vs abstainers

The evidence seemed overwhelming. But what if the comparison group—“abstainers”—was contaminated?

The Sick Quitter

A Hidden Confounder

The Problem

People who stop drinking often do so because they are already ill—liver disease, medication interactions, cancer diagnosis. These “former drinkers” were classified as “abstainers” in most studies.

The Effect: The reference group (abstainers) appeared less healthy—not because abstinence was harmful, but because sick people had joined it.

When Stockwell et al. (2016, J Stud Alcohol Drugs) removed former drinkers and applied appropriate study-quality corrections: the J-curve disappeared. The protective effect was a phantom.

Dose–Response Meta-Analysis

Standard meta-analysis asks: "Does treatment X work?" Dose–response meta-analysis asks: "At what dose does treatment X work best?" It models the relationship between dose level and outcome across multiple studies.

Linear

Simplest: log(RR) = β × dose

Spline

Flexible: piecewise polynomials with knots

Fractional

Polynomial: dose^p1 + dose^p2

Researcher

Restricted Cubic Splines

RCS place knots at pre-specified dose points and fit smooth polynomials between them. Typically 3–5 knots at quantiles of the dose distribution. Linear beyond boundary knots. Tests for non-linearity compare the spline model against a simpler linear model.

AIC

Model Comparison

AIC/BIC compare linear vs spline fit. Lower = better. Also test departure from linearity (p-value for spline terms).

Interactive: Dose–Response Builder

Compare linear vs quadratic vs spline fits. Watch how the model shape changes with different assumptions.

The Alcohol J-Curve Debunked

Stockwell's 2016 re-analysis found that when former drinkers were correctly excluded from the "abstainer" reference group, the protective effect of moderate drinking disappeared. The J-curve was driven by sick-quitter bias.

Dose–response meta-analysis revealed the truth: the shape of the curve depends critically on how you define "zero dose." The wrong reference category created a phantom benefit.

When Curves Shape Policy

The phantom J-curve influenced alcohol guidelines worldwide:

UK

NHS Guidance (until 2016)

“Moderate drinking may protect the heart” appeared in official guidance. After Stockwell’s correction, the UK revised limits to 14 units/week for all drinkers (previously 21 for men). No amount was declared “safe.”

US

Dietary Guidelines Advisory Committee

J-curve studies were cited through 2015. The 2020 committee recommended lowering limits to 1 drink/day for men, acknowledging the reference-group bias.

AU

Australian Guidelines

Safe drinking limits were delayed by industry-funded J-curve research promoting “cardioprotective” moderate intake.

Decision Tree: Is Dose-Response Analysis Appropriate?

Do you have ≥3 exposure levels (not just exposed vs unexposed)?

Yes →

Is the relationship plausibly non-linear?

Yes → Restricted cubic splines (3–5 knots). Compare AIC with linear model.

No → Linear dose-response meta-regression may suffice

No →

Standard pairwise meta-analysis (no dose-response possible with only two levels)

Warning: Always check—is your reference category clean? The J-curve lesson: a contaminated reference group creates phantom non-linearity.

Module 16 Quiz

Q1. What makes restricted cubic splines useful in dose–response meta-analysis?

A. They always produce a straight line

B. They flexibly capture non-linear dose–response curves

C. They reduce the number of studies needed

D. They simplify the model to fewer parameters

Module 16 Complete

"The dose makes the poison. And the shape of the curve reveals whether the poison is real."

Absence of evidence is not evidence of absence.

Module 17: The Fragility

Absence of evidence is not evidence of absence.

Module 17: The Fragility

🎯 Learning Objectives

Calculate and interpret the fragility index
Use GOSH plots to identify influential studies and subset effects
Interpret contour-enhanced funnel plots
Apply Copas selection models and PET-PEESE for publication bias
Understand how sensitivity analyses strengthen meta-analytic conclusions

Governments stockpiled billions

based on evidence they could not see.

After H1N1, governments spent billions on oseltamivir (Tamiflu) stockpiles. The Cochrane team (Jefferson et al. 2014) fought for years to access unpublished data. When they finally did, the evidence for preventing complications evaporated.

The Fragility Index

The fragility index asks: "How many patients would need to change outcome to flip a statistically significant result to non-significant?" It iteratively adds events (converts non-events to events) in the group with fewer events until p > 0.05.

FI = 1

Extremely fragile. One patient flip changes conclusion.

FI > 8

Reasonably robust. Less sensitive to individual outcomes.

Interactive: Fragility Calculator

Enter a 2×2 table to calculate the fragility index. Watch events shift until significance flips.

Events

Total N

Treatment

Control

Researcher

GOSH Plots

Graphical Overview of Study Heterogeneity (GOSH) fits meta-analysis models to all possible subsets of studies. Each dot plots the pooled effect vs I² for one subset. Clusters suggest distinct subgroups; outlier clouds suggest one study driving heterogeneity.

For k studies, there are 2^k−1 subsets. For k > 15, random sampling is used.

Researcher

Contour-Enhanced Funnel Plots

Standard funnel plots show effect size vs standard error. Contour-enhanced versions add shaded regions for p < 0.01, p < 0.05, and p < 0.10. If missing studies fall in non-significant regions, publication bias becomes more plausible (not proven). If they fall in significant regions, other causes (e.g., study quality) may explain asymmetry.

Methodologist

Copas Selection & PET-PEESE

1

Copas Selection Model

Models the probability of a study being published as a function of its SE and effect size. Jointly estimates the true effect and the selection mechanism.

2

PET-PEESE

Precision-Effect Test (PET): regress effects on SE. If intercept = 0, no true effect. PEESE uses SE² for better performance when a true effect exists.

The Oseltamivir Saga

The original Roche-funded meta-analysis (Kaiser 2003) showed oseltamivir reduced influenza complications by 67%. But 8 of 10 trials had never been published. After Cochrane obtained the clinical study reports, the benefit for complications fell to a non-significant 11%.

The fragility was not just statistical—it was informational. The evidence base itself was missing most of the data.

Decision Tree: Interpreting Your Fragility Results

You calculated the Fragility Index. What does the number mean?

FI ≤ 3

Highly fragile. A handful of different events would reverse the conclusion. Interpret with extreme caution.

FI 4–8

Moderately fragile. Sensitive to small perturbations. Are there unpublished trials that might shift this?

FI > 8

Relatively robust. But remember: fragility is only one dimension. Publication bias can undermine even robust results.

Walsh et al. (2014, J Clin Epidemiol) found that in 399 RCTs published in top journals, the median fragility index was just 8. Over 25% had FI ≤ 3. Landmark trials influencing clinical practice were often hanging by a statistical thread.

Methodologist

Beyond the Index: Structural Fragility

The oseltamivir saga revealed three types of fragility—and the Fragility Index captures only the first.

1

Statistical Fragility (FI)

How many events flip the p-value? This is what the Fragility Index measures. It quantifies sensitivity to individual patient outcomes.

2

Informational Fragility

How much of the evidence is hidden? Eight of ten Roche oseltamivir trials were unpublished. The evidence base was structurally incomplete.

3

Analytical Fragility

How many researcher degrees of freedom could change the conclusion? Different outcome definitions, analysis populations, or statistical methods.

Callback to Module 10 (Paroxetine): Re-analysis with different outcome definitions reversed the conclusion entirely. That was analytical fragility—the FI was never calculated because the endpoint itself was disputed. A complete robustness assessment examines all three dimensions.

Module 17 Quiz

Q1. A trial has 200 patients per arm, 12 events in treatment, 25 in control (p=0.03). The fragility index is 3. What does this mean?

A. The effect size is exactly 3

B. Changing just 3 patient outcomes would flip the result to non-significant

C. The result is very robust with 3 confirmatory studies

D. At least 3 patients are needed for the study

Module 17 Complete

"The number that survives every attempt to break it is the number worth trusting."

Not every signal is truth.

Module 18: The Equity

Certainty must be earned, not assumed.

Module 18: The Equity

🎯 Learning Objectives

Identify how trial exclusion criteria create evidence gaps
Apply the PROGRESS-Plus framework to assess equity in evidence
Use PRISMA-Equity reporting guidelines
Understand transportability: when trial findings fail in practice
Design equity-sensitive search and synthesis strategies

SPRINT proved tight blood pressure control

saves lives. But whose lives?

The landmark SPRINT trial excluded patients with diabetes, prior stroke, and heart failure. Over 75% of US hypertensive patients would not have qualified. The evidence was strong but the applicability was narrow.

The Trial That Excluded Most of Its Patients

SPRINT enrolled 9,361 patients and proved that intensive blood pressure control (target <120 mmHg) reduced cardiovascular events by 25% (HR 0.75, 95% CI 0.64–0.89). But the inclusion criteria told a different story.

Who was excluded:

Diabetes — 35% of US adults with hypertension
Prior stroke — 8% of the hypertensive population
Symptomatic heart failure — 6% of hypertensive adults
Expected survival <3 years — the frailest patients
Nursing home residents — excluded entirely
GFR <20 mL/min — advanced kidney disease

Result: Over 75% of US adults with hypertension would NOT have qualified. The evidence was strong. But for whom?

Where the Evidence Comes From

78%

of cardiovascular mega-trial participants came from high-income countries (2000–2020).

6%

from sub-Saharan Africa — where cardiovascular disease is rising fastest.

Polypill trials: 4 of 5 were conducted in populations with mean BMI <25. The US mean BMI is 30. Drug metabolism, comorbidity patterns, healthcare access, and genetic variation all differ across populations. Efficacy in one population does not guarantee effectiveness in another.

Reference: Multinational trials and the PROGRESS-Plus gap

PROGRESS-Plus Framework

P

Place of residence

R

Race / ethnicity

O

Occupation

G

Gender / sex

R

Religion

E

Education

S

SES (socioeconomic)

S

Social capital

Plus: Age, disability, sexual orientation, other vulnerable groups.

Researcher

PRISMA-Equity & Transportability

PRISMA-Equity extends PRISMA to require reporting of how equity was addressed in the review: population characteristics, subgroup analyses by disadvantage, and assessment of applicability to underserved populations.

Transportability: Trial efficacy does not equal real-world effectiveness. Methods exist to re-weight trial data to match the target population distribution.

Researcher

From Trial to Real World: Transportability

Transportability = Can results from trial population X be applied to target population Y? This is not a philosophical question—it has formal methods.

1

Inverse Probability of Participation Weighting (IPPW)

Re-weights trial participants so they resemble the target population on key covariates.

2

Generalizability Index

Quantifies how similar the trial sample is to the target population on observed characteristics.

Stuart et al. (2015, Stat Med): When SPRINT results were re-weighted to match the US hypertensive population, the estimated benefit was attenuated — HR 0.82 (vs 0.75 in-trial). The treatment still works. But the magnitude changes when the population changes.

SPRINT and the Missing Majority

SPRINT was a well-designed trial of 9,361 patients. Its finding (HR 0.75 for intensive vs standard BP control) changed guidelines worldwide. But subsequent analyses showed the benefit was strongest in the subgroup most like the trial population—and uncertain for excluded groups.

Equity in evidence synthesis means asking not just "Does it work?" but "For whom does it work?"

Decision Tree: Equity Assessment for Your Review

ROOT: Does your review's evidence come from populations similar to your target?

YES → Good. But check: Are subgroups (age, sex, ethnicity, SES) reported separately?

Yes: Use subgroup effects for population-specific recommendations
No: Flag as limitation — equity gap in reporting

NO → Does PROGRESS-Plus analysis reveal differential effects?

Yes: Population-specific recommendations needed. Consider transportability re-weighting.
No: Cautious generalization with explicit equity statement in discussion

Methodologist

Callback: The HRT Lesson Revisited

Remember Module 3? The HRT story showed that healthy-user bias made a harmful treatment appear beneficial. SPRINT may have the opposite problem — the “healthy volunteer” effect may make an effective treatment appear more effective than it would be in the real world.

Every meta-analysis should ask: Who was included? Who was excluded? And does that matter?

Module 18 Quiz

Q1. What does the PROGRESS-Plus framework help reviewers assess?

A. Statistical heterogeneity

B. Equity and applicability across disadvantaged populations

C. Internal validity of included studies

D. Overall certainty of evidence

Module 18 Complete

"Evidence that excludes the vulnerable cannot claim to serve them."

Not every signal is truth.

Module 19: The Machine

The number without provenance is not a number.

Module 19: The Machine

🎯 Learning Objectives

Describe how AI/ML is used in systematic review screening
Explain active learning and human-in-the-loop workflows
Assess automation validation: recall, workload savings, and risk
Recognize the limitations and biases of algorithmic screening
Apply frameworks for responsible AI use in evidence synthesis

When COVID-19 hit,

papers arrived faster than humans could read.

By 2021, over 300,000 COVID papers existed. Cochrane used machine learning classifiers to triage studies for their rapid reviews—reducing screening workload by up to 70% while maintaining >95% recall.

The Flood

By April 2020, 4,000 COVID preprints appeared every week.

PubMed indexed 500 new COVID articles per day.

Cochrane's screening queue hit 10,000 unreviewed titles.

🔍 The Mathematics of Impossibility

A pair of reviewers screens ~200 titles per day.

At 500 new articles/day, they fell further behind with every hour.

The living review was dying before it could live.

The First Attempts

The idea was not new. Cohen et al. (2006, JAMIA) first showed that machine learning could reduce screening workload by 50%—with less than 5% loss in recall.

📅

2006: Cohen et al. — SVM classifiers for drug class reviews. Proof of concept.

📅

2016: RobotReviewer (Marshall et al., JMLR) — ML for risk of bias assessment. Inter-rater reliability comparable to human reviewers.

📅

2021: ASReview (van de Schoot et al., Nature Machine Intelligence) — active learning that simulated 95% workload reduction.

But simulation is not reality. COVID would be the first true test at scale.

AI in Systematic Reviews

1

Screening Prioritization

Active learning ranks citations by relevance. Reviewers screen the most likely relevant first.

2

Data Extraction Assist

NLP extracts PICO elements, outcomes, and results. Always requires human verification.

3

Risk of Bias Assessment

ML classifiers predict RoB domains. Experimental—human judgment remains gold standard.

Researcher

Validating Automation

Recall

>95% required. Missing 1 study can change conclusions.

WSS@95%

Work Saved over Sampling at 95% recall.

Stopping

When to stop screening? Consecutive irrelevant threshold.

The fundamental tension: Automation saves time but introduces a new source of error. Always report the tool, version, training data, and stopping criteria.

🔍 The Paradox of Validation

To know if the machine missed a relevant study, you need a human to screen everything.

But if humans screen everything, why use the machine?

The solution: prospective holdout validation.

Random 10% sample screened by both human and machine
Compare: did the machine miss what the human found?
If recall drops below 95%, retrain and expand human screening

Trust, but verify. The machine earns its role—it does not inherit it.

Cochrane's COVID Response

Cochrane built the COVID-19 Study Register using machine learning classifiers trained on millions of records. The system achieved 99% sensitivity while reducing manual screening from weeks to days.

But the machine was a tool, not a replacement. Every included study was still verified by human reviewers. The lesson: AI augments the reviewer, not replaces them.

The Study That Almost Wasn't Found

In June 2020, the RECOVERY trial published its dexamethasone results—the first treatment proven to reduce COVID mortality (28-day mortality: 22.9% vs 25.7%, RR 0.83).

The preprint appeared on medRxiv with a non-standard title. Scenarios like this occurred repeatedly during the pandemic: ML classifiers, trained on existing terminology, ranked unfamiliar framings low.

In several living reviews, human reviewers scanning flagged titles recognized key drug names and escalated studies that classifiers had deprioritized.

Without those humans, landmark treatment findings might have waited weeks to enter the living review.

The machine reads faster. The human reads deeper. Neither is sufficient alone.

Decision Tree: When Should You Use AI?

Your review will screen more than 5,000 titles?

Yes → Consider AI-assisted screening

Active learning prioritization. Dual-screen random 10% holdout. Stop when 3 consecutive batches yield 0 relevant studies.

Report: classifier type, training data, recall on holdout, stopping rule.

No → Manual screening is feasible

For <5,000 titles, dual human screening remains gold standard. AI adds complexity without proportionate benefit.

Is this a living or rapid review?

If yes → AI is especially valuable. Continuous classifier retraining on new evidence. But: never let the machine make the final inclusion decision.

Methodologist

The Pattern Repeats

Remember Module 6? Poldermans fabricated DECREASE data that guided perioperative beta-blocker guidelines for a decade.

AI can now detect statistical anomalies automatically:

GRIM test: Are reported means consistent with integer sample sizes?
SPRITE: Can the reported summary statistics be reconstructed from plausible individual data?
Statcheck: Do reported p-values match the test statistics?

These tools found anomalies in hundreds of published papers—faster than any human auditor.

But the machine flags. The human judges. The decision to retract remains profoundly human.

Module 19 Quiz

Q1. What is the minimum acceptable recall for AI-assisted screening in systematic reviews?

A. 80%

B. 90%

C. >95%

D. 100%

Module 19 Complete

"The machine reads faster. The human reads deeper. Together, they read truth."

Not every signal is truth.

Module 20: The Qualitative

Methods protect patients from our confidence.

Module 20: The Qualitative

🎯 Learning Objectives

Explain why some questions require qualitative evidence synthesis
Describe meta-ethnography (Noblit & Hare) and thematic synthesis
Apply the CERQual framework to assess confidence in qualitative findings
Understand mixed-methods synthesis approaches
Recognize when qualitative evidence changes practice

The WHO asked a question

no RCT could answer.

Why do women worldwide experience disrespect and abuse during childbirth? Bohren et al. (2015) synthesized 65 qualitative studies from 34 countries into a framework of seven domains of mistreatment.

A Question Beyond Randomization

In 2014, the WHO convened a panel to address a global crisis: women were being physically abused, verbally humiliated, and denied care during childbirth. This was not a rare event — reports came from 34 countries.

They needed to understand WHY. What drives disrespect and abuse in maternity care?

No RCT could answer this. You cannot randomize women to abusive vs respectful care. You cannot blind birth attendants. You cannot measure “dignity” on a Likert scale. The evidence had to be qualitative.

Meta-Ethnography

Developed by Noblit & Hare (1988), meta-ethnography translates concepts across studies rather than aggregating numbers. It produces new interpretive frameworks (third-order constructs) from first-order (participant quotes) and second-order (author interpretations) data.

Reciprocal

Studies confirm each other

Refutational

Studies contradict each other

Line of
argument

Studies build a new theory

What Bohren Found: A Taxonomy of Mistreatment

1. Physical abuse

Hitting, pinching, slapping during labor

2. Sexual abuse

Inappropriate touching, non-consensual procedures

3. Verbal abuse

Shouting, threats, judgmental comments

4. Stigma & discrimination

Based on HIV status, ethnicity, age, poverty

5. Professional standards failure

Neglect, lack of informed consent

6. Poor rapport

Poor communication, dismissiveness

7. Health system conditions

Overcrowding, understaffing, lack of supplies

65 studies. 34 countries. The same patterns repeated across languages, cultures, and systems. This was not anecdote. This was synthesized evidence.

Researcher

CERQual: Confidence in Qualitative Evidence

CERQual assesses confidence in qualitative review findings across four components:

1

Methodological Limitations

Quality of contributing studies.

2

Coherence

How well data supports the finding.

3

Adequacy

Richness of data (not just number of studies).

4

Relevance

Applicability to the review question context.

When Qualitative Evidence Changes Practice

Bohren's synthesis informed the WHO's 2018 Recommendations on Intrapartum Care for a Positive Childbirth Experience. Specific changes grounded in qualitative evidence:

Rec. 15

Companionship during labor

Rec. 1

Respectful maternity care

Rec. 3

Effective communication

Rec. 12

Emotional support

These recommendations — grounded in qualitative evidence — now guide maternity care in 194 WHO member states. No forest plot could have produced them. No I² statistic could have revealed them.

Bohren's Framework of Mistreatment

The 2015 qualitative synthesis identified seven domains: physical abuse, sexual abuse, verbal abuse, stigma and discrimination, failure to meet professional standards, poor rapport, and health system conditions. This framework informed the WHO Recommendations on intrapartum care (2018).

No p-value could capture the experience of being slapped during labor. Qualitative synthesis gave voice to what numbers could not.

Decision Tree: When Is Qualitative Synthesis Appropriate?

ROOT: Is your research question about experiences, perceptions, barriers, or facilitators?

YES → Is your question about the HOW or WHY, not just the WHETHER?

Yes: Qualitative evidence synthesis (meta-ethnography, thematic synthesis, or framework synthesis)
No: Consider mixed-methods: quantitative for effect + qualitative for mechanism

NO → Is your question about effectiveness/efficacy?

Yes: Quantitative meta-analysis
But: Complement with qualitative review of implementation barriers (CERQual-assessed)

Key insight: The strongest systematic reviews answer BOTH: Does it work? (quantitative) AND Why does it work or fail? (qualitative)

Module 20 Quiz

Q1. What distinguishes meta-ethnography from quantitative meta-analysis?

A. It only includes 3–5 studies

B. It translates concepts across studies rather than pooling numbers

C. It does not require a systematic search

D. It is less rigorous than quantitative synthesis

Module 20 Complete

"Not everything that counts can be counted. Not everything counted counts."

Heterogeneity is a message, not noise.

Module 21: The Multivariate

Heterogeneity is a message, not noise.

Module 21: The Multivariate

🎯 Learning Objectives

Recognize when outcomes within a study are correlated
Explain multivariate random-effects models
Apply robust variance estimation (RVE) for dependent effect sizes
Understand three-level models for nested data
Choose between multivariate approaches based on data structure

Cardiovascular trials report

mortality, MI, stroke, and more.

These outcomes are correlated within patients. A patient who dies cannot have an MI endpoint. Standard meta-analysis treats each outcome independently—ignoring the dependence and potentially double-counting evidence.

The Assumption Nobody Questions

Open any standard meta-analysis textbook. The models assume each study contributes one independent effect size. But reality is different.

A single cardiovascular trial reports mortality, myocardial infarction, stroke, and revascularization. A single psychotherapy study reports depression, anxiety, and quality of life at 3, 6, and 12 months.

30 trials

× 4 outcomes

= 120

effect sizes

Most analysts either: (a) treat all 120 as independent (inflating precision by a factor of √4), or (b) pick one outcome and discard the rest. Both approaches are wrong.

The Dependency Problem

In standard pairwise meta-analysis, each study contributes one effect size. But many studies report multiple outcomes, subgroups, timepoints, or arms—creating dependent effect sizes. Ignoring this inflates precision and distorts inference.

RVE

Robust Variance Estimation. Sandwich estimator handles unknown correlation.

3-Level

Study → Outcome nesting modeled explicitly.

Researcher

Robust Variance Estimation

RVE (Hedges, Tipton & Johnson, 2010) uses a sandwich-type estimator that provides valid standard errors regardless of the true correlation between dependent effects. No need to know or estimate within-study correlation. Best for ≥20 studies.

Small-sample correction: Tipton & Pustejovsky (2015) developed small-sample corrections (CR2) for RVE, using Satterthwaite degrees of freedom when the number of clusters is small.

Researcher

What Dependence Does to Your Confidence Intervals

If 4 outcomes from the same study have within-study correlation ρ = 0.5:

Treating as independent

CI width = X

Accounting for dependence

CI width = 1.58X

Your confidence interval should be 58% wider. Every meta-analysis that ignored this published falsely precise results.

RVE (Hedges, Tipton & Johnson, 2010): Uses a “sandwich” variance estimator that produces correct standard errors without needing to know the exact within-study correlation.

Researcher

Three-Level Models: Making Structure Explicit

1

Level 1: Sampling Variance

Measurement error within each effect size estimate.

2

Level 2: Within-Study Variance

Outcomes and timepoints vary within a single study.

3

Level 3: Between-Study Variance

Studies differ from each other in populations, settings, and methods.

Example: In a meta-analysis of psychotherapy for depression (k=50 studies, 180 effect sizes), 35% of variance was within-study (different outcomes) and 65% was between-study (different therapies, populations). This decomposition reveals how much heterogeneity is within vs between studies.

Methodologist

Three-Level Models: Formal Framework

When effects are nested (e.g., multiple outcomes within studies, or studies within research groups), a three-level model partitions variance into: (1) sampling variance (level 1), (2) within-study variance (level 2), and (3) between-study variance (level 3). This maintains correct inference while borrowing strength across levels.

The Cardiovascular Challenge

A meta-analysis of statins might include 30 trials, each reporting mortality, MI, stroke, and revascularization. That is 120 effect sizes from 30 clusters. Treating them as 120 independent estimates inflates precision by a factor related to the within-study correlation.

RVE or multivariate models handle this correctly—producing wider, honest confidence intervals.

Decision Tree: Which Approach for Dependent Effect Sizes?

ROOT: Does your meta-analysis have multiple effects per study?

YES → Do you know (or can estimate) the within-study correlations?

Yes: Multivariate random-effects model (most efficient)
No: RVE with small-sample correction (robust to unknown correlations)

NO → Standard univariate random-effects model

Sub-question: Are your multiple effects from different outcomes, timepoints, or subgroups?

Different outcomes → Three-level model or RVE with clustering
Different timepoints → Network of timepoints with temporal correlation
Different subgroups → Consider if subgroups are meaningful or should be averaged

Module 21 Quiz

Q1. What problem does Robust Variance Estimation (RVE) solve?

A. Publication bias

B. Dependency between multiple effect sizes from the same study

C. Between-study heterogeneity

D. Small-study effects

Module 21 Complete

"When outcomes are entangled, pretending they are independent is a lie of convenience."

The number without provenance is not a number.

Module 22: The Proof

The number without provenance is not a number.

Module 22: The Proof

🎯 Learning Objectives

Understand how computational errors propagate through policy
Define reproducibility and distinguish from replicability
Apply evidence hashing and proof-carrying numbers
Use reproducibility checklists for meta-analysis
Recognize the role of pre-registration and open data

A graduate student opened a spreadsheet

and found the austerity era was built on an error.

In 2010, Reinhart and Rogoff claimed countries with >90% debt-to-GDP ratios had negative growth. This influenced austerity policies across Europe. In 2013, Thomas Herndon found an Excel error that excluded 5 countries from the average. The corrected result: modest positive growth, not collapse.

Reproducibility vs Replicability

Reproducible

Same data + same code = same result

Replicable

New data + same methods = consistent result

Reproducibility is the minimum standard. If others cannot reproduce your pooled estimate from your reported data, the analysis cannot be verified. Meta-analyses should share: extracted data, analysis scripts, software versions, and random seeds.

Researcher

Proof-Carrying Numbers

Every number in a meta-analysis should carry its provenance: where it came from, how it was transformed, and what code produced it. Evidence hashing creates a cryptographic fingerprint of inputs so any change (accidental or deliberate) is detectable.

SHA

Input Hash

SHA-256 hash of extracted data. If one cell changes, the hash changes. Provenance chain: data → code → result → hash.

Interactive: Reproducibility Checklist

Check off each item to assess a meta-analysis's reproducibility. How does your review score?

The Excel Error That Changed Economies

Reinhart-Rogoff's "Growth in a Time of Debt" was cited in Congressional testimony, European Commission reports, and IMF policy briefs. The Excel error (rows 30–34 were excluded from an AVERAGE formula) meant five countries—Australia, Austria, Belgium, Canada, and Denmark—were simply missing.

The corrected average went from −0.1% to +2.2%. Austerity policies affected millions. Reproducibility is not academic perfectionism—it is a safeguard against catastrophe.

Remember Module 5?

DECREASE Through the Lens of Reproducibility

The DECREASE trials by Don Poldermans were retracted for fabricated data. If proof-carrying numbers had existed—hashed inputs, provenance chains, verified computations—the fabrication would have been detectable before the evidence entered meta-analyses and changed surgical guidelines.

Module 22 Quiz

Q1. What was the Reinhart-Rogoff error?

A. They used too small a sample

B. An Excel formula excluded 5 countries, reversing the conclusion

C. They studied the wrong time period

D. They used the wrong statistical test

Module 22 Complete

"The number without provenance is not a number. The analysis without reproducibility is not evidence."

Certainty must be earned, not assumed.

Module 23: Your First Meta-Sprint

Certainty must be earned, not assumed.

Module 23: Your First Meta-Sprint

🎯 Learning Objectives

Understand the 40-day systematic review workflow
Map the Seven Principles to real practice phases
Recognize Definition-of-Done (DoD) gates as quality checkpoints
Appreciate why structure prevents the failures you've studied
Graduate ready to conduct (not just understand) meta-analysis

You have learned the stories.

Now you must walk the path.

Every evidence reversal you studied happened because teams knew the methods but didn't follow them systematically.

The META-SPRINT Framework

A 40-day structured workflow with 5 phase gates. Each gate is a Definition-of-Done (DoD) checkpoint that prevents you from moving forward until quality is assured.

40

Days to Completion

5

DoD Phase Gates

Day 34

Hard Freeze

Why 40 days? Long enough for rigor, short enough to prevent scope creep. The rosiglitazone cardiac signals were buried for years because there was no deadline forcing transparency.

The Five Phase Gates

A

DoD-A: Protocol Lock (Days 1-3)

PICOS defined, timepoint rules set, model choices pre-specified. No moving target.

B

DoD-B: Search Lock (Days 6-10)

All databases searched, grey literature checked, PRESS validated. No hidden studies.

C

DoD-C: Extraction Lock (Days 10-28)

Dual extraction, provenance linked, RoB assessed. No fabricated numbers.

The Five Phase Gates (continued)

D

DoD-D: Analysis Lock (Days 21-33)

Forest plots generated, sensitivity analyses run, heterogeneity explored. No cherry-picking.

E

DoD-E: Submission Lock (Days 33-40)

GRADE certainty rated, clinical summary written, manuscript finalized. No overconfidence.

Day 34 Freeze: No new studies can be added after Day 34. This prevents the "weaponized scope creep" that plagued the BMP spine surgery meta-analyses, where industry kept "finding" favorable studies.

The Seven Principles in Practice

Every principle you learned maps to a specific phase gate:

DoD-A "Not every signal is truth" — Pre-specify what counts as evidence

DoD-B "What was hidden in plain sight?" — Search comprehensively

DoD-C "The number without provenance is not a number" — Link every data point

DoD-D "Heterogeneity is a message, not noise" — Investigate, don't ignore

DoD-E "Certainty must be earned, not assumed" — GRADE everything

The Red-Team Principle

Your own team tries to break your work.

Every day, two rotating team members spend 12 minutes checking data quality as adversaries. This is how Boldt's fraud was caught—not by friendly review, but by skeptical checking that noticed impossible recruitment rates.

CondGO: When Things Go Wrong

What happens when you discover a critical problem mid-sprint?

CondGO = Conditional Go

A bounded rescue protocol. You have exactly 72 hours to fix the problem using only allowed actions. If you can't fix it, you must stop the review.

📖 The Avandia lesson: GSK saw cardiovascular signals in 2000 but had no forced deadline. They "watched and waited" for 7 years. Tens of thousands were harmed. CondGO exists because "we'll deal with it eventually" kills people.

You began this course with stories.

You end it ready for practice.

The META-SPRINT workflow takes everything you've learned and structures it into a 40-day system that prevents the failures you've studied.

When you're ready to conduct a real systematic review, open the META-SPRINT application. The stories you've learned here will guide you—appearing as reminders at every step.

What does it look like when every principle is followed?

REAL DATA

The Cholesterol Treatment Trialists' (CTT) Collaboration is the gold standard of meta-analysis. They obtained individual patient data from 170,000+ participants across 26 statin trials. Pre-specified protocol. IPD from all major trials. Standardized outcomes. Result: statins reduce major vascular events by 21% per mmol/L LDL reduction (RR 0.79, 95% CI 0.77-0.81), regardless of baseline risk. This finding, replicated across 5 meta-analyses over 15 years, has prevented an estimated millions of heart attacks and strokes worldwide.

The Seven Principles Applied

The CTT story shows what happens when every principle from this course is followed. Consider the alternative:

PATH A: Without the Principles

No protocol. Published data only. No RoB. No heterogeneity investigation. No GRADE.

↓

Conflicting small trials. Statin controversy persists. Millions untreated.

OUTCOME: Preventable cardiovascular deaths continue

PATH B: The CTT Way

Pre-registered protocol. IPD from all trials. Standardized outcomes. Transparent methods. GRADE High certainty.

↓

Definitive answer. Global guidelines change. Statins prescribed to those who benefit.

OUTCOME: Millions of lives saved by rigorous evidence synthesis

THE REVELATION

Every principle in this course exists because its absence caused harm. The CTT Collaboration proves that when methods are rigorous, when data has provenance, when bias is assessed and certainty is earned — meta-analysis becomes the most powerful tool in medicine. You now carry these principles. Use them.

Capstone Quiz

1. What is the purpose of the Day 34 "hard freeze" in META-SPRINT?

A. To allow time for peer review

B. To prevent late-added studies from manipulating results

C. To speed up publication

D. To coordinate with journal deadlines

2. The CondGO protocol gives teams how long to fix critical problems?

A. 24 hours

B. 48 hours

C. 72 hours

D. 1 week

3. Red-team adversarial QA caught Joachim Boldt's fraud by noticing:

A. Impossible patient recruitment rates

B. p-hacking in statistical tests

C. Inconsistent effect sizes

D. Whistleblower testimony

The stories you've learned are not history.

They are warnings that guard your future work.

When you conduct your first meta-analysis,
remember CAST before you trust a signal,
remember Poldermans before you skip provenance,
remember Reboxetine before you ignore the funnel.

You are now ready. Go with structure. Go with humility. Go with the Seven Principles.

Not every signal is truth.

Module 24: Final Examination

Certainty must be earned, not assumed.

Final Examination

Final Exam: Part 1 of 2

Test your mastery of meta-analysis principles. Each question addresses a core concept from the course.

Q1. A researcher wants to study "the effects of exercise on health." What is the PRIMARY problem with this research question?

A. It lacks randomization

B. Sample size is too small

C. It is not answerable—lacks specific PICO elements

D. It lacks ethical approval

Q2. A funnel plot shows pronounced asymmetry with missing studies in the lower-left region. What does this suggest?

A. Large studies have more precise estimates

B. Small negative studies are likely unpublished

C. The true effect is stronger than estimated

D. Random sampling error

Q3. A meta-analysis reports I² = 85% and τ² = 0.42. What is the MOST appropriate interpretation?

A. There is an 85% chance of a true effect

B. The effect size is very large

C. Substantial between-study variance exists; investigate sources

D. The results are clinically important

Q4. In GRADE, what is the starting certainty for a body of evidence from randomized controlled trials?

A. High

B. Moderate

C. Low

D. Very low

Q5. In RoB 2.0, which domain assesses whether outcome assessors knew the treatment allocation?

A. D1: Randomization process

B. D2: Deviations from intended interventions

C. D3: Missing outcome data

D. D4: Measurement of the outcome

Final Exam: Part 2 of 2

Q6. The CAST trial showed that antiarrhythmic drugs increased mortality despite suppressing arrhythmias. This is an example of:

A. Random sampling error

B. Surrogate outcome failure

C. Confounding by indication

D. Reverse causation

Q7. When should a random-effects model be preferred over a fixed-effect model?

A. When sample sizes are large

B. When outcomes are binary

C. When between-study heterogeneity is expected

D. When publication bias is suspected

Q8. According to ICEMAN criteria, which makes a subgroup analysis MORE credible?

A. Hypothesis specified a priori

B. Large number of subgroups tested

C. No biological rationale

D. Inconsistent effects across trials within subgroup

Q9. What assumption must be checked in network meta-analysis to ensure valid indirect comparisons?

A. All studies have equal sample sizes

B. All studies measure the same outcome

C. Transitivity (consistency of effect modifiers)

D. Double-blinding in all trials

Q10. In Trial Sequential Analysis (TSA), what does crossing the futility boundary indicate?

A. The treatment causes harm

B. Further studies are unlikely to show a meaningful effect

C. The evidence is conclusive for benefit

D. The meta-analysis is underpowered

Part 1 Complete — continue to Part 2 (Advanced Modules)

Final Exam: Part 2 of 2 (Advanced)

Questions 11–25 cover Modules 13–22 (Bayesian, NMA, IPD, Dose-Response, Fragility, Equity, AI, Qualitative, Multivariate, Reproducibility).

Q11. In Bayesian meta-analysis, what happens when you use a vague prior with many studies?

A. The posterior closely matches the frequentist result

B. The prior dominates the posterior

C. The credible interval becomes infinitely wide

D. The model fails to converge

Q12. In Cipriani's antidepressant NMA, why was no single drug declared "the winner"?

A. Too few studies

B. Different drugs ranked best on different outcomes

C. No indirect evidence was available

D. SUCRA could not be calculated

Q13. Why should you never pool IPD as if from one mega-trial?

A. IPD always has fewer studies than aggregate

B. It ignores study clustering and introduces confounding

C. It cannot handle time-to-event data

D. Binary outcomes cannot be pooled

Q14. What caused the alcohol "J-curve" to disappear in Stockwell's reanalysis?

A. New studies were added that showed no benefit

B. Former drinkers were correctly removed from the abstainer reference group

C. The sample size was increased

D. Better adjustment for confounders

Q15. In the oseltamivir saga, what did Cochrane discover when accessing unpublished clinical study reports?

A. The drug was completely ineffective

B. The effect was larger than originally thought

C. The benefit for complications largely disappeared

D. Side effects were more common than reported

Q16. What percentage of US hypertensive patients would NOT have qualified for the SPRINT trial?

A. About 25%

B. About 50%

C. Over 75%

D. Nearly 100%

Q17. Why is AI considered an "augmenter" rather than a "replacer" in systematic reviews?

A. AI is slower than human reviewers

B. AI has perfect recall

C. AI screens fast but cannot make human-level contextual judgments

D. AI is too expensive for most reviews

Q18. What does the "adequacy" component of CERQual assess?

A. The number of studies only

B. The richness and quantity of data supporting the finding

C. Consistency of findings across studies

D. Generalizability to other populations

Q19. A meta-analysis includes 30 statin trials, each reporting 4 correlated outcomes (120 effect sizes). Which approach is correct?

A. Treat all 120 as independent effect sizes

B. Use RVE with small-sample correction

C. Pick only one outcome per study

D. Average the 4 outcomes within each study

Q20. In the Reinhart-Rogoff error, what was the corrected average growth rate for high-debt countries?

A. −0.1% (same as claimed)

B. +2.2%

C. 0%

D. +5%

Passing Score: 15/20 across both parts

Review any missed questions by returning to the relevant module. Each question tests a core concept.

Not every signal is truth.

Methods protect patients from our confidence.

Congratulations

You have completed Evidence Reversal: A Meta-Analysis Course.

May your synthesis be guided by truth, your pooling by wisdom,
and your conclusions by humility.

The Seven Principles:

"Not every signal is truth."

"Methods protect patients from our confidence."

"What was hidden in plain sight?"

"The number without provenance is not a number."

"Heterogeneity is a message, not noise."

"Absence of evidence is not evidence of absence."

"Certainty must be earned, not assumed."

"Guide us to the Straight Path..."

Your Progress

The Seven Principles

Badges Earned

Learning Streak

Module 0: The Opening

🎯 Learning Objectives

What is Meta-Analysis?

Why Pool Studies?

Increase Statistical Power

Improve Precision

Resolve Disagreement

Explore Heterogeneity

When NOT to Pool

The Hierarchy of Evidence

The Seven Principles

Module 0 Quiz

1. Why should you sometimes NOT pool studies in a meta-analysis?

2. Where do systematic reviews of RCTs sit in the evidence hierarchy?

Module 1: The Question

🎯 Learning Objectives

The Observation

The Response

The Logic That Convinced Everyone

CAST: The Cardiac Arrhythmia Suppression Trial

The Results: April 1989

The Human Cost

The Logic - Revisited

What Went Wrong: The Surrogate Trap

The PICO Framework

Investigation Exercise: The Evidence Before CAST

Before: Observational Logic

After: CAST RCT (1989)

The Lessons for Evidence Synthesis

Biological plausibility is not proof

Surrogate endpoints can mislead

Randomized trials provide the strongest causal evidence

Consensus is not evidence

REAL DATA

Module 1 Quiz

1. What was the fundamental error in the antiarrhythmic logic?

2. In PICO, what does the "O" stand for and why does it matter?

Module 2: The Protocol

🎯 Learning Objectives

The Nurses' Health Study

The Hidden Bias

WHI: The Women's Health Initiative

The Results: July 2002

REAL DATA

PROSPERO Registration

Register Before You Search

Lock Your Decisions

Document Amendments

Prevent Duplication

Module 2 Quiz

1. Why did the Nurses' Health Study show benefit from HRT that WHI did not?

2. What is the primary purpose of PROSPERO registration?

Module 3: The Search

🎯 Learning Objectives

The Published Evidence (Pre-2007)

Nissen's Discovery: May 2007

The Meta-Analysis Results

The FDA Advisory Committee: July 2007

The Aftermath

What a Comprehensive Search Requires

The PRESS Checklist

Translation of Research Question

Boolean and Proximity Operators

Subject Headings

Text Words

PRESS Checklist (continued)

Spelling, Syntax, Line Numbers

Limits and Filters

Database Translation

REAL DATA

Module 3 Quiz

1. What type of evidence source revealed the rosiglitazone cardiovascular signal?

2. What does PRESS stand for?

Module 4: The Screening

🎯 Learning Objectives

The Rise of Vioxx