The Silicon Scribe: AI in Meta-Analysis

Have you not heard of the machine that reads
ten thousand abstracts in an hour,
that extracts data while you sleep,
that promises to free you from drudgery?

The AI Revolution in Evidence Synthesis

67%

Workload reduction
with AI screening

95%

Recall achievable
with active learning

10x

Faster screening
than manual

THE PROMISE

AI can screen abstracts, extract data, assess risk of bias, and monitor for new evidence—if used correctly.

When AI Fails in Healthcare

IBM WATSON ONCOLOGY, MD ANDERSON, 2013-2017

In 2013, MD Anderson Cancer Center partnered with IBM Watson to revolutionize cancer treatment recommendations. The project cost $62 million.

By 2017, the project was abandoned. Watson's recommendations were found to be "unsafe and incorrect" in multiple cases.

In one documented case, Watson recommended a treatment that would cause severe bleeding in a patient already on blood thinners.

The core problem: Watson had been trained primarily on hypothetical cases created by physicians, not real patient data. The AI learned to mimic expert opinions rather than learn from actual outcomes.

Stat News, 2017; IEEE Spectrum, 2019

THE LESSON

AI trained on synthetic or hypothetical data fails on real patients. The gap between training data and reality can be lethal.

The Hallucination Problem

LAWYERS SANCTIONED, NEW YORK, 2023

Attorneys used ChatGPT to research case law for a federal court brief.

The AI cited six cases with full citations, quotes, and page numbers.

None of the cases existed.

The judge found the citations were "gibberish" and sanctioned the lawyers.

This is not a bug. It is how large language models work— they predict plausible text, not verified truth.

Mata v. Avianca, Inc., 22-cv-1461 (S.D.N.Y. 2023)

The Core Question

When to Trust AI in Meta-Analysis

AI Tool Output

↓

Task Type?

Ranking/Prioritization

Lower riskHuman reviews top-ranked

Binary Decision

Medium riskNeeds validation

Text Generation

High riskHallucination possible

What AI Can and Cannot Do

Honest Assessment

Screening prioritization ✓ Excellent

Duplicate detection ✓ Excellent

Data extraction (structured) ⚠ Needs verification

Risk of bias assessment ⚠ Preliminary only

Writing protocol/methods ⚠ Draft only

Statistical analysis ✗ Human required

Clinical interpretation ✗ Human required

"The machine reads fast but does not understand.
It predicts the next word, not the truth.
Use it to accelerate, not to replace.
The judgment must remain yours."

Have you not seen the reviewer
who screened ten thousand titles by hand,
whose eyes grew tired, whose attention wandered,
who missed the one study that mattered?

The Screening Tools

ASReview

Active learning
Open source

Free

Rayyan

AI recommendations
Collaboration

Freemium

Abstrackr

Semi-automated
Web-based

Free

EPPI-Reviewer

Priority screening
Full workflow

Subscription

How Active Learning Works

ASReview Workflow

Import References

↓

Screen seed papers10-20 known relevant

↓

AI learns patternsUpdates with each decision

↓

Prioritizes likely relevantMost promising first

↓

Stopping rule?

Consecutive irrelevante.g., 100-200 in row

% screenede.g., 50% with recall check

Real Performance Data

VAN DE SCHOOT ET AL., 2021

Systematic evaluation of ASReview across 4 datasets:

• PTSD dataset: 95% recall after screening 40% of records
• Software fault prediction: 95% recall after 20%
• Virus metagenomics: 95% recall after 10%

Average workload reduction: 67-95% depending on prevalence.

But: Performance varies by topic and prevalence. Low-prevalence topics show greater efficiency gains.

Van de Schoot R et al. Nat Mach Intell. 2021;3:125-133

When AI-Assisted Screening Works

ASREVIEW AND COCHRANE COVID-19 RESPONSE, 2020

During the COVID-19 pandemic, Cochrane needed to screen 50,000+ citations weekly to keep reviews current.

ASReview's active learning system was deployed with rigorous human oversight:

• Reduced human screening workload by 75%
• Missed fewer than 1% of relevant studies
• Validated at every stage by human reviewers

The key to success: human-in-the-loop validation at every stage. The AI prioritized, but humans made final decisions and checked samples of AI-excluded records.

Cochrane COVID-NMA consortium, 2020-2021

THE LESSON

AI augments human judgment; it does not replace it. Success comes from partnership, not automation.

When Internal Validation Fails

EPIC SEPSIS MODEL, JAMA INTERNAL MEDICINE, 2021

Epic Systems deployed a sepsis prediction algorithm to hundreds of hospitals across the United States.

Epic's internal validation showed excellent performance. Hospitals trusted it.

Then came the external validation study in JAMA Internal Medicine:

• The model missed 67% of sepsis cases
• It triggered thousands of false alarms
• Nurses developed severe "alert fatigue"

The model had been validated on historical data from the same system—it had never been tested in the real clinical environment where it would be deployed.

Wong A et al. JAMA Intern Med. 2021;181(8):1065-1070

THE LESSON

Internal validation is not external validation. A model that works in development may fail in deployment. Always validate in the real-world context.

The Stopping Problem

THE HIDDEN DANGER

When do you stop screening with active learning?

If you stop too early: You miss relevant studies
If you stop too late: You lose efficiency gains

The algorithm cannot tell you when you've found everything. It only ranks what remains.

There is no perfect stopping rule. Every rule trades recall for efficiency.

CRITICAL POINT

You must validate your stopping rule by manually checking a random sample of unscreened records.

AI Screening Decision Tree

Should You Use AI Screening?

Large Reference Set?

↓

<500 refs

Manual OKAI overhead not worth it

500-2000 refs

AI helpfulModerate efficiency gain

>2000 refs

AI essentialMajor time savings

↓

Always validate with random sampleReport methodology in paper

"The machine finds the needles faster,
but it cannot guarantee none remain in the haystack.
Trust the ranking, verify the stopping,
and always report what you did."

Have you not dreamed of the assistant
who reads every paper and fills every cell,
who never tires, never errs,
who extracts perfectly?

That assistant does not exist.

The Extraction Accuracy Problem

GPT-4 DATA EXTRACTION STUDY, 2024

Researchers tested GPT-4 for extracting data from 100 RCT papers.

Results:
• Sample sizes: 89% accurate
• Effect estimates: 76% accurate
• Confidence intervals: 71% accurate
• Risk of bias judgments: 62% agreement with humans

A 24% error rate in effect estimates means roughly 1 in 4 studies would have wrong data in your meta-analysis.

Guo Y et al. J Clin Epidemiol. 2024;165:111203

The Fabrication Problem

GPT-4 HALLUCINATIONS IN SYSTEMATIC REVIEWS, 2023

Researchers tested GPT-4 for data extraction from systematic review papers. The model was given PDFs and asked to extract sample sizes, p-values, and effect estimates.

GPT-4 confidently provided all requested numbers with precise formatting.

But 23% of extractions were "hallucinations"—numbers with no basis in the source text.

In one case, the model fabricated a statistically significant result (p=0.003) from a study that actually found no significant effect (p=0.42).

The model's confidence was indistinguishable between real and fabricated data.

Systematic review AI validation studies, 2023

THE LESSON

LLMs require 100% human verification for quantitative data. There is no shortcut. Every number must be checked against the source.

LLM Data Extraction Workflow

Safe LLM Extraction Protocol

PDF/Full Text

↓

LLM extracts dataStructured prompt

↓

Human verifies 100%NOT sampling

↓

Discrepancy?

Yes

Human value usedDocument error

No

ProceedLog verification

Prompt Engineering for Extraction

# Example extraction prompt

Extract the following from this RCT:

1. Sample size (intervention arm): [number]
2. Sample size (control arm): [number]
3. Primary outcome definition: [text]
4. Effect estimate: [number with unit]
5. 95% CI: [lower, upper]
6. p-value: [number]

If not reported, write "NR"
If unclear, write "UNCLEAR: [reason]"

# Provide exact quotes for verification

When LLMs Help vs. Hurt

LLM Extraction Value Assessment

Standardized fields (author, year) ✓ High accuracy

Simple numeric (sample size) ✓ Usually reliable

Complex numeric (adjusted OR) ⚠ Often wrong model

Composite outcomes ⚠ Misses components

Intention-to-treat vs per-protocol ✗ Frequently confused

Subgroup data ✗ High error rate

"The LLM extracts plausible numbers,
not necessarily correct numbers.
It is a fast first draft, not a final answer.
Every cell must be verified by human eyes."

Have you not wished for a judge
who reads every methods section,
who assesses bias without bias,
who never disagrees with themselves?

RobotReviewer

MARSHALL ET AL., NATURE MACHINE INTELLIGENCE, 2019

RobotReviewer uses machine learning to assess risk of bias in RCTs.

Validation against Cochrane assessments:
• Random sequence generation: 71% agreement
• Allocation concealment: 65% agreement
• Blinding of participants: 69% agreement
• Blinding of outcome assessment: 62% agreement

Human inter-rater agreement is typically 70-80%.

RobotReviewer approaches but does not exceed human performance.

Marshall IJ et al. Nat Mach Intell. 2019;1:115-117

RoB Automation Decision Tree

When to Use Automated RoB

Risk of Bias Assessment

↓

Review Type?

Rapid review

Automated OKAcknowledge limitation

Scoping review

Automated OKIf RoB included

Full systematic review

Preliminary onlyHuman verification required

Cochrane review

Human requiredDraft support only

Limitations of Automated RoB

What Machines Cannot Assess

✗ Outcome-specific bias (RoB 2 domain 4)

✗ Selective reporting based on protocol comparison

✗ Contextual judgment (Is this design appropriate?)

✗ Cross-paper inconsistencies (multiple reports)

✗ Funding influence on results interpretation

THE FUNDAMENTAL LIMIT

AI reads what is written. Bias assessment often requires judging what is not written.

Hybrid Workflow for RoB

Best Practice Protocol

Full Text PDFs

↓

RobotReviewer screeningFlags potential issues

↓

Reviewer 1 assessesUsing AI output as reference

↓

Reviewer 2 independentlyBlinded to AI output

↓

Consensus meeting

↓

Final assessmentHuman decision documented

"The robot reads the methods section
but cannot read between the lines.
Use it to flag, not to judge.
The verdict must be human."

Have you not wished for the writer
who drafts your protocol in minutes,
who knows every PRISMA item,
who writes in perfect academic prose?

LLMs for Protocol Drafting

✓

Structure
generation

✓

Boilerplate
text

⚠

PICO
formulation

✗

Search
strategy

THE VALUE PROPOSITION

LLMs can draft the structure and standard language. You must provide the scientific decisions.

The Search Strategy Danger

TESTED ACROSS MULTIPLE LLMs, 2023-2024

Researchers asked GPT-4 and Claude to generate MEDLINE search strategies.

Common errors:
• Invented MeSH terms that don't exist
• Wrong field codes (e.g., [tiab] vs [tw])
• Missing key concepts from the research question
• Overly narrow strategies missing relevant studies
• Syntax errors that wouldn't execute

An information specialist must write or validate all search strategies.

Multiple validation studies 2023-2024

Protocol Writing Decision Tree

LLM Use in Protocol Development

Protocol Section

↓

Background/Rationale

LLM helpfulDraft + fact-check

Methods structure

LLM helpfulTemplate generation

PICO criteria

Human decidesLLM refines wording

Search strategy

Human/SpecialistAI too unreliable

Safe LLM Protocol Workflow

Quality Assurance Steps

1 Define PICO yourself (human scientific decision)

2 Ask LLM to draft protocol sections

3 Verify all cited guidelines exist (PRISMA, Cochrane)

4 Write search strategy with information specialist

5 Check all methodological decisions are defensible

6 Disclose AI assistance in protocol

7 Register the human-verified version

"The machine can write the words,
but it cannot make the decisions.
You define the question. You choose the methods.
The protocol is yours—AI is the typist."

Have you not seen the systematic review
that was out of date before it was published,
while new trials accumulated in the literature,
unsynthesized, unknown?

The Living Review Problem

COVID-19 EVIDENCE TSUNAMI, 2020

In the first year of the pandemic:

• 100,000+ COVID papers published
• Traditional reviews obsolete within weeks
• Clinicians made decisions on incomplete evidence

The COVID-NMA consortium used AI-assisted surveillance to monitor new trials daily and update meta-analyses weekly.

This required: automated search monitoring, AI screening prioritization, rapid data extraction workflows, and continuous statistical updates.

Defined in Cochrane Living Reviews guidance

AI Components for Living Reviews

Automated Surveillance Stack

Living Review System

↓

Auto-searchDaily/weekly runs

AI triagePriority screening

Rapid extractionLLM-assisted

Auto-updateCumulative MA

↓

Human oversight at each stageEditorial review before publication

Tools for Continuous Monitoring

PubMed Alerts

Free email alerts
Saved searches

Basic

Epistemonikos

Systematic review
database

AI-curated

Covidence

Auto-import
Living mode

Subscription

DistillerSR

AI screening
+ monitoring

Enterprise

Living Review Decision Framework

When to Make a Review "Living"

Should This Be Living?

↓

Criteria Check

Priority questionClinical importance

Evidence evolvingActive trial pipeline

Resources securedFunding for 2+ years

↓

All three required for living status

"The machine watches the literature
while you sleep.
But someone must wake to judge
whether the new evidence changes the truth."

If you use the machine without verification,
you do not know what errors you have made.

If you verify everything the machine produces,
what time have you saved?

The answer lies in strategic verification.

The Verification Paradox

THE DILEMMA

Full verification = No time savings
No verification = Unknown error rate
Strategic verification = Validated efficiency

Verification Strategy by Risk

High-risk tasks

100% human reviewData extraction, RoB

Medium-risk tasks

Sample validationScreening decisions

Low-risk tasks

Spot checksDeduplication

When Oversight Catches Bias

COCHRANE MACHINE LEARNING PILOT, 2022

Cochrane tested ML-assisted risk of bias assessment to accelerate systematic reviews.

The algorithm achieved 85% agreement with human reviewers—seemingly impressive.

But the QA team analyzed the 15% disagreements and found a pattern:

The AI was systematically biased toward rating industry-funded trials as low risk.

The training data contained more "low risk" labels for pharmaceutical company trials—the algorithm learned this correlation without understanding the underlying methodological concerns.

Human oversight caught the pattern before any biased reviews were published.

Cochrane Methods Group pilot study, 2022

THE LESSON

Disagreement analysis reveals systematic bias. High overall accuracy can hide dangerous patterns. Always analyze where and how the AI fails, not just how often.

QA Framework for AI-Assisted Reviews

Minimum Quality Standards

1 Pre-specify AI use in protocol (which tools, which tasks)

2 Document AI settings (model version, prompts, parameters)

3 Validate screening with random sample (calculate recall estimate)

4 Verify all extracted data against source documents

5 Human RoB assessment (AI as preliminary only)

6 Track error rates per AI task

7 Report transparently in methods section

Reporting AI Use (PRISMA-S)

WHAT TO REPORT IN YOUR PAPER

• Which AI tools were used (name, version, date)
• Which tasks were AI-assisted
• What validation was performed
• What error rates were observed
• What human oversight was maintained
• Any deviations from protocol due to AI limitations

EMERGING STANDARD

Journals increasingly require AI use statements. PRISMA-S extension for search reporting includes automation.

The Complete AI-MA Workflow

Integrated Human-AI Process

Protocol (Human + LLM draft)

↓

Search (Human/Specialist)

↓

Screening (AI prioritize + Human decide)

↓

Extraction (LLM draft + Human verify 100%)

↓

RoB (AI flag + Human assess)

↓

Analysis (Human)

↓

Interpretation (Human)

"The machine is neither colleague nor replacement.
It is a tool—powerful, fast, and fallible.
Document what you used. Validate what it produced.
The responsibility remains yours."

Have you not considered
whose labor trained the model,
whose data it consumed without consent,
whose jobs it may displace?

The Hidden Labor

KENYAN DATA LABELERS, TIME MAGAZINE 2023

ChatGPT was made "safe" through a process called RLHF— Reinforcement Learning from Human Feedback.

The humans providing that feedback were workers in Kenya, paid less than $2 per hour to read and label toxic, violent, and disturbing content.

They developed psychological trauma from the work.

Every AI tool you use rests on human labor— often invisible, often underpaid, often harmed.

Perrigo B. Time Magazine. 2023 Jan 18.

Automating Inequality

UK A-LEVEL ALGORITHM SCANDAL, 2020

When COVID-19 cancelled A-Level exams in the UK, the government used an algorithm to predict student grades based on historical school performance.

The results:

• Students from disadvantaged schools were systematically downgraded
• Students from private schools were upgraded
• The algorithm overrode teacher predictions that students would succeed

After massive public outcry, 40% of grades were revised.

The algorithm had encoded historical inequality as prediction. Schools that historically sent fewer students to university were penalized, regardless of individual student ability.

UK Office of Qualifications and Examinations Regulation, 2020

THE LESSON

AI can automate bias at scale. When historical data reflects systemic inequality, algorithms trained on that data perpetuate and amplify it.

Ethical Framework for AI in Research

Questions to Ask

1 Transparency: Can I fully disclose how AI was used?

2 Accountability: Who is responsible for AI errors?

3 Equity: Does AI access create research inequities?

4 Labor: Whose work enabled this tool?

5 Environment: What is the carbon cost of model training?

6 Reproducibility: Can others replicate my AI-assisted work?

Authorship and AI

ICMJE POSITION

AI tools cannot be listed as authors.

Authors must take responsibility for AI-generated content.

AI use must be disclosed in methods or acknowledgments.

YOUR RESPONSIBILITY

If the AI hallucinates and you publish it, you bear the responsibility— not OpenAI, not Anthropic, not the tool.

"The machine has no conscience.
It does not care if the data is true.
It does not know who was harmed to train it.
You must be the conscience it lacks."

The Road Ahead

Where AI in Evidence Synthesis is Going

Emerging Capabilities

Multimodal AI

Extract from
figures/tables

2024-2025

Agent Systems

Multi-step
workflows

Emerging

RAG Systems

Retrieval-augmented
generation

Active research

Fine-tuned Models

MA-specific
training

In development

What Will NOT Change

Enduring Human Requirements

★ Defining the research question (clinical judgment)

★ Interpreting clinical significance (domain expertise)

★ Assessing applicability (contextual knowledge)

★ Making recommendations (value judgments)

★ Taking responsibility (ethical accountability)

THE CONSTANT

AI will accelerate the mechanics.
The science remains human.

Preparing for the Future

Skills to Develop

Future-Ready Researcher

↓

Prompt engineeringGetting good AI outputs

Validation methodsKnowing when AI errs

Core methodsAI cannot replace

↓

The best AI users are the best methodologistsUnderstanding enables oversight

"The machine grows stronger each year.
But the question remains the same:
What is true? What helps patients?
AI can assist the search.
Only you can provide the answer."

Test Your Knowledge

What is the main limitation of using LLMs for data extraction?

They are too slow

They can generate plausible but incorrect data (hallucinations)

They cannot read PDFs

They are too expensive

When using AI screening (e.g., ASReview), what must you always do?

Trust the AI completely after training

Screen only the top 10% of ranked records

Validate the stopping rule with a random sample

Use multiple AI tools simultaneously

For which task should AI NEVER be the final decision-maker?

Deduplication

Screening prioritization

Clinical interpretation of results

Reference formatting

References

Key Sources

Van de Schoot R et al. Nat Mach Intell. 2021;3:125-133. [ASReview]
Marshall IJ et al. Nat Mach Intell. 2019;1:115-117. [RobotReviewer]
Guo Y et al. J Clin Epidemiol. 2024;165:111203. [GPT-4 extraction]
Mata v. Avianca, 22-cv-1461 (S.D.N.Y. 2023). [Hallucination case]
Perrigo B. Time Magazine. 2023 Jan 18. [AI labor ethics]
Elliott JH et al. J Clin Epidemiol. 2017;91:23-30. [Living reviews]
Cochrane Handbook 2023. Chapter on automation.
ICMJE. Recommendations on AI authorship. 2023.
Rethlefsen ML et al. J Med Libr Assoc. 2021. [PRISMA-S]
Wang S et al. Syst Rev. 2023;12:178. [AI screening validation]

✔

Course Complete

"You now know the Silicon Scribe—
its powers and its limits.
Use it to accelerate, not to replace.
Validate what it produces.
Document what you did.
And remember always:
The machine predicts the next word.
You must judge if that word is true."

ASReview: Step-by-Step Tutorial

From installation to stopping decision

Step 1: Installation

# Option A: Python pip (recommended)
pip install asreview

# Option B: Download desktop app
# https://asreview.nl/download/

# Launch ASReview LAB
asreview lab

REQUIREMENTS

• Python 3.8+ (for pip install)
• OR: Windows/Mac desktop app (no Python needed)
• Your references in RIS, CSV, or EndNote XML format

Step 2: Create Project & Import

Project Setup Workflow

New Project

↓

Name your projectDescriptive, include date

↓

Import referencesRIS/CSV/XML file

↓

ASReview deduplicatesCheck count matches expected

↓

Ready for prior knowledge

Step 3: Add Prior Knowledge

CRITICAL STEP

The model learns from your initial decisions.
You need both relevant AND irrelevant examples.

Prior Knowledge Strategy

1 Add 5-10 known relevant studies (from scoping search)

2 Search for clearly irrelevant topics (random sample)

3 Mark 10-20 irrelevant as negative examples

4 Aim for ~1:2 ratio (relevant:irrelevant) to start

WARNING

Poor prior knowledge = poor model performance.
Garbage in, garbage out.

Step 4: Screen with Active Learning

Screening Loop

ASReview presents record

↓

Your decision

RelevantInclude for full-text

IrrelevantExclude

↓

Model updatesRe-ranks remaining

↓

Next most likely relevantRepeat until stopping rule

Step 5: Stopping Decision

Stopping Rules Compared

Consecutive irrelevant (50-200) Common, but no recall guarantee

% of total screened (e.g., 50%) Predictable effort, variable recall

All records screened 100% recall, no time savings

Statistical stopping (Busfelder) Evidence-based, requires plugin

VALIDATION REQUIREMENT

After stopping: manually screen random sample of unscreened records.
Report estimated recall with confidence interval.

"The tool is simple. The decisions are not.
Feed it good examples. Check when you stop.
Export your project file—it is your audit trail."

Prompt Engineering Library

Validated prompts for meta-analysis tasks

Prompt Principles

For Reliable LLM Outputs

1 Be specific: Define exact fields and formats

2 Provide examples: Show expected output format

3 Request uncertainty: Ask for "NR" or "UNCLEAR" flags

4 Demand quotes: Require source text for verification

5 Limit scope: One task per prompt, not everything at once

Prompt 1: RCT Data Extraction

Extract the following from this RCT. For each field, provide:
- The value
- The exact quote from the paper (in quotes)
- "NR" if not reported, "UNCLEAR" if ambiguous

FIELDS:
1. Intervention group sample size (ITT): [n]
2. Control group sample size (ITT): [n]
3. Primary outcome definition: [text]
4. Primary outcome: intervention events/total: [x/n]
5. Primary outcome: control events/total: [x/n]
6. Risk ratio (95% CI): [RR (lower, upper)]
7. Follow-up duration: [weeks/months]

OUTPUT FORMAT: JSON with "value" and "quote" for each field

Prompt 2: Study Characteristics

Extract study characteristics. Provide exact quotes for verification.

FIELDS:
1. Study design: [RCT / Cluster RCT / Crossover / Other]
2. Country/countries: [list]
3. Setting: [hospital / primary care / community / other]
4. Recruitment period: [start date - end date]
5. Funding source: [text]
6. Trial registration: [ID number or "NR"]
7. Conflicts of interest declared: [Yes/No/NR]

If information is in supplementary materials, note "See Supplement".
If truly not reported anywhere, mark "NR".

Prompt 3: Population Characteristics

Extract baseline population characteristics.
Report for INTERVENTION and CONTROL groups separately.

FIELDS (per group):
1. N randomized: [n]
2. N analyzed: [n]
3. Age: [mean (SD) or median (IQR)]
4. Sex (% female): [%]
5. Key inclusion criteria: [text]
6. Key exclusion criteria: [text]
7. Disease severity at baseline: [measure and value]

NOTE: If groups combined only, report combined with note.

Prompt 4: Risk of Bias Screening

NOTE: This is for PRELIMINARY flagging only.
Human assessment required for final judgment.

For each RoB 2 domain, identify relevant text:

D1 Randomization:
- Sequence generation method: [quote or NR]
- Allocation concealment method: [quote or NR]

D2 Deviations:
- Blinding of participants: [quote or NR]
- Blinding of personnel: [quote or NR]

D3 Missing data:
- Attrition rates: [intervention: x%, control: y%]
- Handling of missing data: [quote or NR]

DO NOT make judgments. Only extract quotes.

"The prompt is your contract with the machine.
Be precise in what you ask.
Demand evidence for every answer.
Verify every output against the source."

You may never write a systematic review.
But you will read them.

How do you know if the AI assistance
was done well or poorly?

The IBM Watson Oncology Failure

MD ANDERSON CANCER CENTER, 2017

IBM Watson for Oncology was trained to recommend cancer treatments.

After spending $62 million, MD Anderson cancelled the project.

Internal documents showed Watson made "unsafe and incorrect" treatment recommendations. It was trained on synthetic cases, not real patient data.

The AI looked confident. The recommendations were dangerous.

Lesson: AI confidence ≠ AI correctness

STAT News investigation, 2017; IEEE Spectrum 2019

Questions for AI-Assisted Reviews

What to Look for in Methods

1 Did they name the AI tools used? (version, date)

2 Did they specify which tasks were AI-assisted?

3 Did they validate AI outputs? How?

4 For AI screening: What stopping rule? What estimated recall?

5 For AI extraction: Was 100% human verified?

6 Was there human oversight of all AI decisions?

Red Flags in AI-Assisted Reviews

Warning Signs

"AI screened all titles" No human involvement?

"GPT extracted data" No verification mentioned?

"Stopped after 500 consecutive irrelevant" No recall estimate?

"AI-generated protocol" Human decisions unclear?

No AI tools mentioned but clearly AI-written Hidden AI use

For Patients and Clinicians

WHAT YOU NEED TO KNOW

Good AI use: Speeds up the work, human verifies
Bad AI use: Replaces human judgment, no validation

An AI-assisted review can be trustworthy—if done right.

Simple Questions to Ask

? "Was AI used in this review?"

? "Were the AI results checked by humans?"

? "Could AI have missed important studies?"

"AI assistance is not a flaw—it is often an advantage.
But only if validated, only if disclosed.
Ask: Was the machine checked?
If the answer is unclear, so is the review."

Have you not considered the researcher
with unstable internet, limited compute,
no institutional subscription,
who still needs to synthesize evidence?

Free & Offline-Capable Tools

ASReview

Desktop app
Works offline

FREE

Abstrackr

Web-based
Free accounts

FREE

Rayyan

Free tier
Limited AI

FREEMIUM

RevMan

Cochrane tool
Full MA software

FREE

Offline Workflow

When Internet is Unreliable

Search Phase

↓

Library/cafe: download all PDFsBatch download when connected

↓

Screening Phase

↓

ASReview desktopWorks fully offline

↓

Extraction Phase

↓

Spreadsheet + local PDFsNo AI needed

Low-Cost LLM Alternatives

WHEN API COSTS ARE PROHIBITIVE

• Claude/ChatGPT free tiers: Limited but functional
• Ollama + local models: Free, runs on laptop (requires download)
• Hugging Face inference: Free tier available
• Manual extraction: Still gold standard, just slower

HONEST ASSESSMENT

AI is a convenience, not a necessity.
All Cochrane reviews were done without AI.
Quality comes from methods, not from tools.

Resource-Limited Decision Tree

Choosing Your Approach

Your Resources

↓

Internet reliability?

Stable

Web tools OKRayyan, Covidence

Unreliable

Desktop toolsASReview offline

None

Manual + spreadsheetsStill valid

"The evidence belongs to everyone,
not just those with fast internet and paid subscriptions.
The tools may differ. The methods remain.
Quality synthesis is possible anywhere."

Validation Calculations

Sample sizes for AI verification

Estimating Recall After AI Screening

THE PROBLEM

You stopped screening at 1000 of 5000 records.
How confident are you that you found all relevant studies?

Validation Sampling

Unscreened records (n=4000)

↓

Random sample (n=400)10% or at least 200

↓

Manual screening

0 relevant foundRecall ≈ 95-100%

Relevant foundScreen all remaining

Sample Size Formula

FOR 95% CONFIDENCE IN RECALL

                    n = ln(1 - confidence) / ln(1 - prevalence)

                    Example:

                    If prevalence of relevant = 1% (0.01)

                    For 95% confidence (0.95):

                    n = ln(1 - 0.95) / ln(1 - 0.01)

                    n = ln(0.05) / ln(0.99)

                    n ≈ 299 records to sample

Quick Reference Table

Sample Sizes for Validation

Prevalence 0.5%, 95% conf 598 records

Prevalence 1%, 95% conf 299 records

Prevalence 2%, 95% conf 149 records

Prevalence 5%, 95% conf 59 records

Practical minimum 200 records (conservative)

Reporting Your Validation

Example Methods Text:

"We used ASReview LAB (v1.2) for title/abstract screening with
active learning. Screening ceased after 150 consecutive
irrelevant records, having screened 1,247 of 4,892 records
(25%). To validate recall, we manually screened a random
sample of 300 unscreened records. No additional relevant
studies were identified, suggesting estimated recall ≥95%
(binomial 95% CI: 91-100%)."

"Validation is not optional—it is the price of efficiency.
Calculate your sample. Screen it manually.
Report what you found. Admit what you might have missed."