====================模块1:承诺与危险====================
你有没有听说过机器读取
ten thousand abstracts in an hour,
在你睡觉时提取数据,
that promises to 释放你摆脱苦差事?
证据合成的人工智能革命
67%
Workload reduction
with AI screening
95%
Recall achievable
主动学习
10x
Faster screening
than manual
THE PROMISE
人工智能可以筛选摘要、提取数据、评估偏倚风险并监测新证据—if used correctly.
When AI Fails in Healthcare
IBM WATSON ONCOLOGY, MD ANDERSON, 2013-2017
2013年,MD安德森癌症中心与IBM Watson合作彻底改变了癌症治疗建议。项目成本 $62 million.

到2017年,该项目被放弃。 Watson 的建议被发现 “不安全且不正确” in multiple cases.

In one documented case, Watson recommended a treatment that would cause severe bleeding in a patient already on blood thinners.

The core problem: Watson had been trained primarily on hypothetical cases created by physicians,而不是真实的患者数据。人工智能学会了模仿专家意见,而不是从实际结果中学习。
Stat News, 2017; IEEE Spectrum, 2019
THE LESSON
根据合成或假设数据训练的人工智能在真实患者身上失败了。训练数据与现实之间的差距可能是致命的。
幻觉问题
LAWYERS SANCTIONED, NEW YORK, 2023
Attorneys used ChatGPT to research case law for a federal court brief.

人工智能引用了六个案例,并附有完整的引文、引文和页码。

这些案例都不存在。

法官发现引文是“胡言乱语”,并且制裁了律师。

这不是一个错误。这就是大型语言模型的工作原理——它们预测可信的文本,而不是经过验证的事实。
Mata v. Avianca, Inc., 22-cv-1461 (S.D.N.Y. 2023)
核心问题

When to Trust AI in Meta-Analysis

AI Tool Output
Task Type?
Ranking/Prioritization
Lower riskHuman reviews top-ranked
Binary Decision
Medium riskNeeds validation
Text Generation
High riskHallucination possible
人工智能可以做什么和不能做什么

Honest Assessment

Screening prioritization ✓ Excellent
Duplicate detection ✓ Excellent
数据提取(结构化) ⚠ Needs verification
Risk of bias assessment ⚠ Preliminary only
写作协议/方法 ⚠ Draft only
Statistical analysis ✗ Human required
Clinical interpretation ✗ Human required
“机器读取速度快,但无法理解。
它预测下一个单词,而不是事实。
用它来加速,而不是替换。
The judgment must remain yours."
==================== 模块 2:AI 辅助筛选====================
您没有看到审阅者
who screened ten thousand titles by hand,
whose eyes grew tired, whose attention wandered,
错过了 一项重要的研究?
筛选工具
ASReview
Active learning
Open source
Free
Rayyan
AI recommendations
Collaboration
Freemium
Abstrackr
Semi-automated
Web-based
Free
EPPI-Reviewer
Priority screening
Full workflow
Subscription
How Active Learning Works

ASReview Workflow

Import References
Screen seed papers10-20 known relevant
AI learns patterns每个更新决策
Prioritizes likely relevantMost promising first
Stopping rule?
Consecutive irrelevante.g., 100-200 in row
% screened例如,50%进行召回检查
真实性能数据
VAN DE SCHOOT ET AL., 2021
Systematic evaluation of ASReview across 4 datasets:

PTSD dataset: 95% recall after screening 40% of records
Software fault prediction: 95% recall after 20%
Virus metagenomics: 95% recall after 10%

Average workload reduction: 67-95% depending on prevalence.

But: Performance varies by topic and prevalence. Low-prevalence topics show greater efficiency gains.
Van de Schoot R et al. Nat Mach Intell. 2021;3:125-133
When AI-Assisted Screening Works
ASREVIEW AND COCHRANE COVID-19 RESPONSE, 2020
During the COVID-19 pandemic, Cochrane needed to screen 50,000+ citations weekly to keep reviews current.

ASReview的主动学习系统是在严格的人工监督下部署的:

• Reduced human screening workload by 75%
• Missed fewer than 1% of relevant studies
• Validated at every stage by human reviewers

成功的关键: human-in-the-loop validation at every stage人工智能优先考虑,但人类做出了最终决定并检查了人工智能排除的记录样本。
Cochrane COVID-NMA consortium, 2020-2021
THE LESSON
人工智能增强人类判断力;它不会取代它。成功来自合作,而不是自动化。
When Internal Validation Fails
EPIC SEPSIS MODEL, JAMA INTERNAL MEDICINE, 2021
Epic Systems deployed a sepsis prediction algorithm to hundreds of hospitals 在美国各地。

Epic's internal validation showed excellent performance. Hospitals trusted it.

然后是《JAMA Internal Medicine》中的外部验证研究:

• The model missed 67% of sepsis cases
• It triggered thousands of false alarms
• Nurses developed severe "alert fatigue"

该模型已根据来自同一系统的历史数据进行了验证 - 它从未在真实的临床环境中进行过测试。已部署。
Wong A et al. JAMA Intern Med. 2021;181(8):1065-1070
THE LESSON
内部验证不是外部验证。在开发中有效的模型在部署中可能会失败。始终在现实世界中进行验证。
停止问题
隐藏的危险
您什么时候停止主动学习的筛选?

如果您停止得太早: 您会错过相关的内容研究
如果您停止得太晚: 您将失去效率收益

算法无法告诉您何时找到所有内容。它只会对剩下的内容进行排名。

There is no perfect stopping rule. Every rule trades recall for efficiency.
CRITICAL POINT
You must 验证您的停止规则 by manually checking a random sample of unscreened records.
AI Screening Decision Tree

您应该使用人工智能筛选吗?

Large Reference Set?
<500 refs
Manual OK人工智能开销不值得
500-2000 refs
AI helpfulModerate efficiency gain
>2000 refs
AI essentialMajor time savings
Always validate with random sampleReport methodology in paper
“机器更快地找到针,
but it cannot guarantee none remain in the haystack.
相信排名,验证停止,
并始终报告您所做的事情。”
==================== 模块 3:用于数据提取的法学硕士 ====================
您没有梦想过助手
who reads every paper and fills every cell,
who never tires, never errs,
who extracts perfectly?

那个助手不存在。
提取精度问题
GPT-4 数据提取研究,2024
研究人员测试了 GPT-4 从 100 篇 RCT 论文中提取数据的能力。

Results:
• Sample sizes: 89% accurate
• Effect estimates: 76% accurate
• Confidence intervals: 71% accurate
• Risk of bias judgments: 62% agreement with humans

A 24% error rate 实际估计值大约为 1 4 项研究中的荟萃分析数据会出现错误。
Guo Y et al. J Clin Epidemiol. 2024;165:111203
制造问题
GPT-4 HALLUCINATIONS IN SYSTEMATIC REVIEWS, 2023
研究人员测试了 GPT-4 从系统评价论文中提取数据的能力。该模型获得 PDF 并要求提取样本大小、p 值和效果估计。

GPT-4 confidently provided all requested numbers with precise formatting.

But 提取的 23% 是“幻觉”— 源文本中没有基础的数字。

In one case, the model fabricated a statistically significant result (p=0.003) 来自实际发现的研究 no significant effect (p=0.42).

模型的置信度无法区分真实数据和捏造数据。
系统审查人工智能验证研究,2023
THE LESSON
法学硕士要求对定量数据进行 100% 人工验证。没有捷径。必须对照源检查每个数字。
LLM 数据提取工作流程

Safe LLM Extraction Protocol

PDF/Full Text
LLM 提取数据Structured prompt
Human verifies 100%NOT sampling
Discrepancy?
Yes
Human value usedDocument error
No
ProceedLog verification
提示提取工程
# Example extraction prompt

Extract 此 RCT 中的以下内容:

1. Sample size (intervention arm): [number]
2. Sample size (control arm): [number]
3. Primary outcome definition: [text]
4. Effect estimate: [number with unit]
5. 95% CI: [lower, upper]
6. p-value: [number]

If not reported, write "NR"
If unclear, write "UNCLEAR: [reason]"

# Provide exact quotes for verification
When LLMs Help vs. Hurt

LLM Extraction Value Assessment

Standardized fields (author, year) ✓ High accuracy
Simple numeric (sample size) ✓ Usually reliable
Complex numeric (adjusted OR) ⚠ Often wrong model
Composite outcomes ⚠ Misses components
Intention-to-treat vs per-protocol ✗ Frequently confused
Subgroup data ✗ High error rate
"The LLM extracts plausible numbers,
不一定正确数字。
这是一个快速的初稿,而不是最终答案。
Every cell must be verified by human eyes."
==================== 模块 4:自动偏见风险 ====================
您是否不希望有一个从不反对
who reads every methods section,
who assesses bias without bias,
完整系统性的法官 themselves?
RobotReviewer
MARSHALL ET AL., NATURE MACHINE INTELLIGENCE, 2019
RobotReviewer uses machine learning to assess risk of bias in RCTs.

Validation against Cochrane assessments:
• Random sequence generation: 71% agreement
• Allocation concealment: 65% agreement
• Blinding of participants: 69% agreement
• Blinding of outcome assessment: 62% agreement

Human inter-rater agreement is typically 70-80%.

RobotReviewer approaches but does not exceed human performance.
Marshall IJ et al. Nat Mach Intell. 2019;1:115-117
RoB Automation Decision Tree

When to Use Automated RoB

Risk of Bias Assessment
Review Type?
Rapid review
Automated OKAcknowledge limitation
Scoping review
Automated OKIf RoB included
回顾
Preliminary onlyHuman verification required
Cochrane review
Human requiredDraft support only
Limitations of Automated RoB

What Machines Cannot Assess

Outcome-specific bias (RoB 2 domain 4)
Selective reporting based on protocol comparison
Contextual judgment (Is this design appropriate?)
Cross-paper inconsistencies (multiple reports)
资金对结果解释的影响
基本限制
AI reads what is written. Bias assessment often requires judging what is not written.
RoB的混合工作流程

Best Practice Protocol

Full Text PDFs
RobotReviewer screeningFlags potential issues
Reviewer 1 assessesUsing AI output as reference
Reviewer 2 independentlyBlinded to AI output
Consensus meeting
Final assessmentHuman decision documented
“机器人读取方法部分
but cannot read between the lines.
用它来标记,而不是法官。
判决必须是人性化的。”
====================== 模块 5:协议编写的 GPT ====================
您是否不希望作家
who drafts your protocol in minutes,
who knows every PRISMA item,
who writes in perfect academic prose?
用于协议起草的法学硕士
Structure
generation
Boilerplate
text
PICO
formulation
Search
strategy
价值主张
法学硕士可以起草 结构和标准语言。您必须提供 scientific decisions.
搜索策略危险
TESTED ACROSS MULTIPLE LLMs, 2023-2024
Researchers asked GPT-4 and Claude to generate MEDLINE search strategies.

Common errors:
• Invented MeSH terms that don't exist
• Wrong field codes (e.g., [tiab] vs [tw])
•研究问题中缺少关键概念
• Overly narrow strategies missing relevant studies
• Syntax errors that wouldn't execute

An information specialist must write or validate all search strategies.
2023-2024年多项验证研究
Protocol Writing Decision Tree

LLM Use in Protocol Development

Protocol Section
Background/Rationale
LLM helpfulDraft + fact-check
Methods structure
LLM helpfulTemplate generation
PICO criteria
Human decidesLLM refines wording
Search strategy
Human/SpecialistAI too unreliable
Safe LLM Protocol Workflow

Quality Assurance Steps

1 Define PICO yourself (human scientific decision)
2 Ask LLM to draft protocol sections
3 Verify all cited guidelines exist (PRISMA, Cochrane)
4 Write search strategy with information specialist
5 Check all methodological decisions are defensible
6 Disclose AI assistance in protocol
7 注册人工验证版本
“机器可以写出单词,
but it cannot make the decisions.
你定义问题。你选择方法。
协议是你的——人工智能是打字员。”
==================== 模块 6:生活评论 + 人工智能====================
你没看过系统评论
在发布之前就已经过时了,
while new trials accumulated in the literature,
unsynthesized, unknown?
活生生的评论问题
COVID-19证据海啸, 2020
大流行的第一年:

100,000+ COVID papers published
• Traditional reviews obsolete within weeks
• Clinicians made decisions on incomplete evidence

COVID-NMA 联盟使用 AI-assisted surveillance to monitor new trials daily and update meta-analyses weekly.

这需要:自动搜索监控、AI 筛选优先级、快速数据提取工作流程和持续统计更新。
Defined in Cochrane Living Reviews guidance
生活人工智能组件评论

Automated Surveillance Stack

实时审核系统
Auto-searchDaily/weekly runs
AI triagePriority screening
Rapid extractionLLM-assisted
Auto-updateCumulative MA
Human oversight at each stage出版前的编辑审核
持续监控工具
PubMed Alerts
Free email alerts
Saved searches
Basic
Epistemonikos
Systematic review
database
AI-curated
Covidence
Auto-import
Living mode
Subscription
DistillerSR
AI screening
+ monitoring
Enterprise
实时审核决策框架

何时进行审核“生活”

这应该是生活吗?
Criteria Check
Priority questionClinical importance
Evidence evolvingActive trial pipeline
Resources secured资助 2 年以上
All three required for living status
“当你睡觉时,机器会观察文献

But someone must wake to judge
新证据是否会改变事实。”
==================== 模块 7:质量保证框架====================
如果您在没有验证的情况下使用机器,
您不知道自己犯了什么错误。

如果您验证机器产生的所有内容,
what time have you saved?

答案就在 strategic verification.
验证悖论
THE DILEMMA
Full verification = No time savings
No verification = Unknown error rate
Strategic verification = Validated efficiency

Verification Strategy by Risk

High-risk tasks
100% human review数据提取,RoB
Medium-risk tasks
Sample validationScreening decisions
Low-risk tasks
Spot checksDeduplication
When Oversight Catches Bias
COCHRANE MACHINE LEARNING PILOT, 2022
Cochrane tested ML-assisted risk of bias assessment to accelerate systematic reviews.

该算法与人类审阅者取得了 85%的一致性—seemingly impressive.

但是QA团队分析了15%的分歧并发现了一个模式:

The AI was systematically biased toward rating industry-funded trials as low risk.

训练数据包含制药公司试验的更多“低风险”标签 - 算法在不了解潜在方法论问题的情况下了解了这种相关性。

Human oversight caught the pattern before any biased reviews were published.
Cochrane 方法组试点研究,2022
THE LESSON
分歧分析揭示了系统偏差。高整体准确度可能隐藏危险模式。始终分析 AI 失败的位置和方式,而不仅仅是失败的频率。
AI 辅助审核的 QA 框架

Minimum Quality Standards

1 Pre-specify AI use in protocol (which tools, which tasks)
2 Document AI settings (model version, prompts, parameters)
3 Validate screening with random sample (calculate recall estimate)
4 验证所有提取的数据 against source documents
5 Human RoB assessment (AI as preliminary only)
6 Track error rates per AI task
7 Report transparently in methods section
Reporting AI Use (PRISMA-S)
由于 AI 而需要在论文中报告哪些内容
Which AI tools were used (name, version, date)
Which tasks were AI-assisted
What validation was performed
What error rates were observed
What human oversight was maintained
Any deviations 限制
EMERGING STANDARD
Journals increasingly require AI use statements. PRISMA-S extension for search reporting includes automation.
完整的 AI-MA 工作流程

Integrated Human-AI Process

Protocol (Human + LLM draft)
Search (Human/Specialist)
Screening (AI prioritize + Human decide)
Extraction (LLM draft + Human verify 100%)
RoB (AI flag + Human assess)
Analysis (Human)
Interpretation (Human)
"The machine is neither colleague nor replacement.
它是一个强大、快速且容易出错的工具。
Document what you used. Validate what it produced.
责任仍由您承担。”
==================== 模块 8:道德考虑因素====================
你有没有考虑过
whose labor trained the model,
whose data it consumed without consent,
whose jobs it may displace?
隐藏的劳动力
KENYAN DATA LABELERS, TIME MAGAZINE 2023
ChatGPT通过一个名为RLHF的过程变得“安全”——从人类反馈中进行强化学习。

提供反馈的人是肯尼亚的工人,有偿 less than $2 per hour 阅读和标记有毒、暴力和令人不安的内容。

他们在工作中遭受了心理创伤。

你使用的每一个人工智能工具都依赖于人类劳动——通常是看不见的、经常报酬过低、经常受到伤害。
Perrigo B. Time Magazine. 2023 Jan 18.
Automating Inequality
UK A-LEVEL ALGORITHM SCANDAL, 2020
当COVID-19取消了A-Level考试时英国,政府使用一种算法根据历史学校表现来预测学生成绩。

The results:

• Students from disadvantaged schools were systematically downgraded
• Students from 私立学校升级
• 该算法推翻了教师对学生会成功的预测

After massive public outcry, 修改了 40% 的成绩.

该算法已编码 historical inequality as prediction,无论学生个人能力如何,历史上送入大学的学生较少的学校都会受到惩罚。
UK Office of Qualifications and Examinations Regulation, 2020
THE LESSON
人工智能可以自动产生大规模偏见。当历史数据反映系统性不平等时,基于该数据训练的算法会延续并放大它。
人工智能的道德框架。研究

Questions to Ask

1 Transparency: Can I fully disclose how AI was used?
2 Accountability: 谁对人工智能错误负责?
3 Equity: Does AI access create research inequities?
4 Labor: 谁的工作启用了这个工具?
5 Environment: What is the carbon cost of model training?
6 Reproducibility: Can others replicate my AI-assisted work?
Authorship and AI
ICMJE POSITION
AI tools cannot be listed as authors.

Authors must take responsibility for AI-generated content.

AI use must be disclosed in methods or acknowledgments.
YOUR RESPONSIBILITY
如果 AI 产生幻觉并且你发布了它, 你承担责任- 不是 OpenAI,不是 Anthropic,不是工具。
“机器没有良心。
它不关心数据是否真实。
它不知道谁因训练它而受到伤害。
你一定是它所缺乏的良心。”
====================== 模块 9:未来方向====================
前方道路
证据合成中的人工智能的发展方向
Emerging Capabilities
Multimodal AI
Extract from
figures/tables
2024-2025
Agent Systems
Multi-step
workflows
Emerging
RAG Systems
Retrieval-augmented
generation
Active research
Fine-tuned Models
MA-specific
training
In development
什么不会改变

Enduring Human Requirements

定义研究问题(临床)判断)
Interpreting clinical significance (domain expertise)
Assessing applicability (contextual knowledge)
Making recommendations (value judgments)
Taking responsibility (ethical accountability)
THE CONSTANT
人工智能将加速机制。
科学仍然是人性的。
为未来做准备

Skills to Develop

Future-Ready Researcher
Prompt engineeringGetting good AI outputs
Validation methodsKnowing when AI errs
Core methodsAI cannot replace
最好的人工智能用户是最好的方法学家Understanding enables oversight
"The machine grows stronger each year.
但问题仍然是相同:
What is true? What helps patients?
人工智能可以协助搜索。
只有您可以提供答案。“
====================== 模块 10:测验和参考 ====================
测试您的知识
使用LLM进行数据提取的主要限制是什么?
它们太慢
They can generate plausible but incorrect data (hallucinations)
They cannot read PDFs
它们太昂贵
When using AI screening (e.g., ASReview), what must you always do?
Trust the AI completely after training
Screen only the top 10% of ranked records
使用随机样本验证停止规则
使用多个AI工具同时
人工智能永远不应该成为哪个任务的最终决策者?
Deduplication
Screening prioritization
结果的临床解释
Reference formatting
References

Key Sources

  1. Van de Schoot R et al. Nat Mach Intell. 2021;3:125-133. [ASReview]
  2. Marshall IJ et al. Nat Mach Intell. 2019;1:115-117. [RobotReviewer]
  3. Guo Y et al. J Clin Epidemiol. 2024;165:111203. [GPT-4 extraction]
  4. Mata v. Avianca, 22-cv-1461 (S.D.N.Y. 2023). [Hallucination case]
  5. Perrigo B. Time Magazine. 2023 Jan 18. [AI labor ethics]
  6. Elliott JH et al. J Clin Epidemiol. 2017;91:23-30. [Living reviews]
  7. Cochrane Handbook 2023. Chapter on automation.
  8. ICMJE. Recommendations on AI authorship. 2023.
  9. Rethlefsen ML et al. J Med Libr Assoc. 2021. [PRISMA-S]
  10. Wang S et al. Syst Rev. 2023;12:178. [AI screening validation]
Course Complete
“您现在了解硅抄写员 -
its powers and its limits.
用它来加速,而不是替换。
Validate what it produces.
记录您所做的事情。
并记住总是:
机器预测下一个单词。
您必须判断该单词是否为真。”
==================== 模块 11:ASREVIEW STEP-BY-STEP ====================
ASReview: Step-by-Step Tutorial
从安装到停止决定
Step 1: Installation
# Option A: Python pip (recommended)
pip install asreview

# 选项 B:下载桌面应用程序
# https://asreview.nl/download/

# Launch ASReview LAB
asreview lab
REQUIREMENTS
• Python 3.8+(用于 pip 安装)
• OR: Windows/Mac desktop app (no Python needed)
• Your references in RIS, CSV, or EndNote XML format
Step 2: Create Project & Import

Project Setup Workflow

New Project
为项目命名Descriptive, include date
Import referencesRIS/CSV/XML file
ASReview deduplicatesCheck count matches expected
准备就绪先验知识
Step 3: Add Prior Knowledge
CRITICAL STEP
该模型会从您的初始决策中学习。
You need 相关和不相关 examples.

Prior Knowledge Strategy

1 Add 5-10 known relevant 研究(来自范围界定搜索)
2 Search for clearly irrelevant topics (random sample)
3 Mark 10-20 irrelevant as negative examples
4 Aim for ~1:2 ratio (relevant:irrelevant) to start
WARNING
Poor prior knowledge = poor model performance.
Garbage in, garbage out.
Step 4: Screen with Active Learning

Screening Loop

ASReview presents record
Your decision
Relevant包含全文
IrrelevantExclude
Model updatesRe-ranks remaining
Next most likely relevantRepeat until stopping rule
Step 5: Stopping Decision

Stopping Rules Compared

Consecutive irrelevant (50-200) Common, but no recall guarantee
% of total screened (e.g., 50%) Predictable effort, variable recall
All records screened 100% recall, no time savings
Statistical stopping (Busfelder) Evidence-based, requires plugin
VALIDATION REQUIREMENT
After stopping: manually screen random sample of unscreened records.
Report estimated recall with confidence interval.
“该工具很简单。决策是不是。
Feed it good examples. Check when you stop.
导出您的项目文件 - 这是您的审计跟踪。“
====================== 模块 12:提示工程库 ======================
提示工程库
Validated prompts for meta-analysis tasks
Prompt Principles

用于可靠的 LLM 输出

1 Be specific: Define exact fields and formats
2 Provide examples: Show expected output format
3 Request uncertainty: 请求“NR”或“UNCLEAR”标志
4 Demand quotes: Require source text for verification
5 Limit scope: One task per prompt, not everything at once
提示 1:RCT 数据提取
从此 RCT 中提取以下内容。对于每个字段,提供:
- The value
- 论文中的确切引用(用引号引起来)
- 如果未报告,则为“NR”;如果不明确,则为“UNCLEAR”

FIELDS:
1. Intervention group sample size (ITT): [n]
2. Control group sample size (ITT): [n]
3. Primary outcome definition: [text]
4. Primary outcome: intervention events/total: [x/n]
5. Primary outcome: control events/total: [x/n]
6. Risk ratio (95% CI): [RR (lower, upper)]
7. Follow-up duration: [weeks/months]

OUTPUT FORMAT: 每个字段包含“值”和“引用”的 JSON
提示 2:研究特征
提取研究特征。提供准确的报价以供验证。

FIELDS:
1. Study design: [RCT / Cluster RCT / Crossover / Other]
2. Country/countries: [list]
3. Setting: [hospital / primary care / community / other]
4. Recruitment period: [start date - end date]
5. Funding source: [text]
6. Trial registration: [ID number or "NR"]
7. Conflicts of interest declared: [Yes/No/NR]

If information is in supplementary materials, note "See Supplement".
If truly not reported anywhere, mark "NR".
Prompt 3: Population Characteristics
Extract baseline population characteristics.
分别报告干预组和控制组。

FIELDS (per group):
1. N randomized: [n]
2. N analyzed: [n]
3. Age: [mean (SD) or median (IQR)]
4. Sex (% female): [%]
5. Key inclusion criteria: [text]
6. Key exclusion criteria: [text]
7. Disease severity at baseline: [measure and value]

NOTE: If groups combined only, report combined with note.
Prompt 4: Risk of Bias Screening
NOTE: 这仅用于初步标记。
Human assessment required for final judgment.

For each RoB 2 domain, identify relevant text:

D1 Randomization:
- 序列生成方法:[报价或 NR]
- 分配隐藏方法:[报价或NR]

D2 Deviations:
- Blinding of participants: [quote or NR]
- Blinding of personnel: [quote or NR]

D3 Missing data:
- Attrition rates: [intervention: x%, control: y%]
- 丢失数据的处理:[引用或 NR]

DO NOT make judgments. Only extract quotes.
“提示是您与机器之间的合同。
要求的内容要准确。
要求每项证据
Verify every output against the source."
==================== 模块 13:阅读人工智能辅助评论 ====================
你可能永远不会写一篇系统评论。
但你会 read them.

你怎么知道如果AI 援助
was done well or poorly?
The IBM Watson Oncology Failure
MD ANDERSON CANCER CENTER, 2017
IBM Watson for Oncology 接受了癌症治疗建议的培训。

After spending $62 millionMD Anderson 取消了该项目。

Internal documents showed Watson made “不安全且不正确” 治疗建议是根据合成病例进行培训的,而不是真实的患者数据。

AI 的建议看起来很自信。危险。

Lesson: AI confidence ≠ AI correctness
STAT News investigation, 2017; IEEE Spectrum 2019
AI 辅助审核问题

在方法中寻找什么

1 Did they 为 AI 工具命名 used? (version, date)
2 Did they specify which tasks were AI-assisted?
3 Did they validate AI outputs? How?
4 对于 AI 筛选:什么 stopping rule? What estimated recall?
5 对于 AI 提取:是 100% human verified?
6 Was there human oversight of all AI decisions?
Red Flags in AI-Assisted Reviews

Warning Signs

"AI screened all titles" No human involvement?
“GPT 提取数据” No verification mentioned?
"Stopped after 500 consecutive irrelevant" No recall estimate?
"AI-generated protocol" Human decisions unclear?
No AI tools mentioned but clearly AI-written Hidden AI use
供患者和临床医生使用
您需要了解的
Good AI use: Speeds up the work, human verifies
Bad AI use: Replaces human judgment, no validation

An AI-assisted review can be trustworthy—if done right.

Simple Questions to Ask

? “本次审核使用了人工智能吗?”
? “人工智能结果是否经过人类检查?”
? "Could AI have missed important studies?"
"AI assistance is not a flaw—it is often an advantage.
But only if validated, only if disclosed.
问:机器检查了吗?
如果答案不清楚,那么答案也不清楚审查。”
==================== 第 14 模块:资源限制设置 ======================
您是否没有考虑过研究人员
with unstable internet, limited compute,
no institutional subscription,
who still needs to synthesize evidence?
免费且具有离线功能工具
ASReview
Desktop app
Works offline
FREE
Abstrackr
Web-based
Free accounts
FREE
Rayyan
Free tier
Limited AI
FREEMIUM
RevMan
Cochrane tool
Full MA software
FREE
Offline Workflow

When Internet is Unreliable

Search Phase
图书馆/咖啡馆:下载所有PDF连接时批量下载
Screening Phase
ASReview desktopWorks fully offline
Extraction Phase
Spreadsheet + local PDFsNo AI needed
Low-Cost LLM Alternatives
WHEN API COSTS ARE PROHIBITIVE
Claude/ChatGPT free tiers: Limited but functional
Ollama + local models: Free, runs on laptop (requires download)
Hugging Face inference: Free tier available
Manual extraction: Still gold standard, just slower
HONEST ASSESSMENT
人工智能是一种便利,而不是必需品。
All Cochrane reviews were done without AI.
质量来自方法,而不是来自工具。
Resource-Limited Decision Tree

选择你的方法

Your Resources
Internet reliability?
Stable
Web tools OKRayyan, Covidence
Unreliable
Desktop toolsASReview offline
None
Manual + spreadsheetsStill valid
“证据属于每个人,
而不仅仅是那些拥有快速互联网和付费订阅的人。
工具可能有所不同。方法仍然存在。
Quality synthesis is possible anywhere."
==================== 模块 15:验证计算 ====================
Validation Calculations
AI 验证的样本大小
Estimating Recall After AI Screening
THE PROBLEM
您在 5000 个中的 1000 个时停止筛选记录。
您对找到所有相关研究有多大信心?

Validation Sampling

Unscreened records (n=4000)
Random sample (n=400)10% or at least 200
Manual screening
0 relevant foundRecall ≈ 95-100%
Relevant foundScreen all remaining
Sample Size Formula
回忆时的置信度为 95%
n = ln(1 - confidence) / ln(1 - prevalence)

Example:
If prevalence of relevant = 1% (0.01)
For 95% confidence (0.95):

n = ln(1 - 0.95) / ln(1 - 0.01)
n = ln(0.05) / ln(0.99)
n ≈ 299 records to sample
Quick Reference Table

用于验证的样本量

Prevalence 0.5%, 95% conf 598 records
Prevalence 1%, 95% conf 299 records
Prevalence 2%, 95% conf 149 records
Prevalence 5%, 95% conf 59 records
Practical minimum 200 records (conservative)
报告您的验证
示例方法文本:

“我们使用 ASReview LAB (v1.2) 进行标题/摘要筛选,并确定了
active learning. Screening ceased after 150 consecutive
irrelevant records, having screened 1,247 of 4,892 records
(25%). To validate recall, we manually screened a random
sample of 300 unscreened records. No additional relevant
研究,表明估计召回率≥95%
(binomial 95% CI: 91-100%)."
”验证不是可选的 - 这是效率。
Calculate your sample. Screen it manually.
报告您的发现。承认你可能错过了什么。”