The Silicon Scribe: 메타 분석의 AI

책을 읽는 기계에 대해 들어본 적이 있나요?
ten thousand abstracts in an hour,
읽는 기계에 대해 들어본 적이 있습니까? 잠자는 동안 데이터를 추출하여
that promises to 고된 일로부터 해방?

증거 합성의 AI 혁명

67%

Workload reduction
with AI screening

95%

Recall achievable
능동 학습을 통해

10x

Faster screening
than manual

THE PROMISE

AI는 초록을 선별하고, 데이터를 추출하고, 편향 위험을 평가하고, 새로운 것을 모니터링할 수 있습니다. 증거—if used correctly.

When AI Fails in Healthcare

IBM WATSON ONCOLOGY, MD ANDERSON, 2013-2017

2013년 MD Anderson Cancer Center는 IBM Watson과 협력하여 암 치료 권장 사항에 혁명을 일으켰습니다. 프로젝트 비용 $62 million.

2017년까지 프로젝트가 중단되었습니다. Watson의 권장 사항은 실제 환자 데이터가 아닌 "안전하지 않고 잘못된" in multiple cases.

In one documented case, Watson recommended a treatment that would cause severe bleeding in a patient already on blood thinners.

The core problem: Watson had been trained primarily on hypothetical cases created by physicians것으로 밝혀졌습니다. AI는 실제 결과에서 배우기보다는 전문가 의견을 모방하는 방법을 학습했습니다.

Stat News, 2017; IEEE Spectrum, 2019

THE LESSON

합성 또는 가상 데이터로 훈련된 AI는 실제 환자에게는 실패합니다. 훈련 데이터와 현실 사이의 격차는 치명적일 수 있습니다.

환각 문제

LAWYERS SANCTIONED, NEW YORK, 2023

Attorneys used ChatGPT to research case law for a federal court brief.

AI는 전체 인용, 인용 및 페이지 번호가 포함된 6건의 사례를 인용했습니다.

어떤 사례도 존재하지 않았습니다.

판사는 인용이 다음과 같다고 판단했습니다. "gibberish" and sanctioned the lawyers.

This is not a bug. 큰 언어 모델이 작동하는 방식은 확인된 진실이 아니라 그럴듯한 텍스트를 예측하는 것입니다.

Mata v. Avianca, Inc., 22-cv-1461 (S.D.N.Y. 2023)

핵심 질문

When to Trust AI in Meta-Analysis

AI Tool Output

↓

Task Type?

Ranking/Prioritization

Lower riskHuman reviews top-ranked

Binary Decision

Medium riskNeeds validation

Text Generation

High riskHallucination possible

AI가 할 수 있는 것과 할 수 없는 것

Honest Assessment

Screening prioritization ✓ Excellent

Duplicate detection ✓ Excellent

데이터 추출(구조적) ⚠ Needs verification

Risk of bias assessment ⚠ Preliminary only

쓰기 프로토콜/방법 ⚠ Draft only

Statistical analysis ✗ Human required

Clinical interpretation ✗ Human required

"기계는 읽기는 빠르지만 이해하지 못합니다.
진실이 아닌 다음 단어를 예측합니다.
속도를 높이기 위해 사용합니다. 교체하세요.
The judgment must remain yours."

놓친 리뷰어
who screened ten thousand titles by hand,
whose eyes grew tired, whose attention wandered,
를 못 보셨나요? 중요한 연구?

스크리닝 도구

ASReview

Active learning
Open source

Free

Rayyan

AI recommendations
Collaboration

Freemium

Abstrackr

Semi-automated
Web-based

Free

EPPI-Reviewer

Priority screening
Full workflow

Subscription

How Active Learning Works

ASReview Workflow

Import References

↓

Screen seed papers10-20 known relevant

↓

AI learns patterns각 결정에 따른 업데이트

↓

Prioritizes likely relevantMost promising first

↓

Stopping rule?

Consecutive irrelevante.g., 100-200 in row

% screened예: 회상 검사를 통한 50%

실제 성능 데이터

VAN DE SCHOOT ET AL., 2021

Systematic evaluation of ASReview across 4 datasets:

• PTSD dataset: 95% recall after screening 40% of records
• Software fault prediction: 95% recall after 20%
• Virus metagenomics: 95% recall after 10%

Average workload reduction: 67-95% depending on prevalence.

But: Performance varies by topic and prevalence. Low-prevalence topics show greater efficiency gains.

Van de Schoot R et al. Nat Mach Intell. 2021;3:125-133

When AI-Assisted Screening Works

ASREVIEW AND COCHRANE COVID-19 RESPONSE, 2020

During the COVID-19 pandemic, Cochrane needed to screen 50,000+ citations weekly to keep reviews current.

ASReview의 능동 학습 시스템은 사람의 엄격한 감독 하에 배포되었습니다.

• Reduced human screening workload by 75%
• Missed fewer than 1% of relevant studies
• Validated at every stage by human reviewers

성공의 열쇠: human-in-the-loop validation at every stageAI가 우선시되었지만 최종 결정은 사람이 내리고 AI가 제외된 기록의 샘플을 확인했습니다.

Cochrane COVID-NMA consortium, 2020-2021

THE LESSON

AI는 인간의 판단력을 강화합니다. 그것은 그것을 대체하지 않습니다. 성공은 자동화가 아닌 파트너십에서 비롯됩니다.

When Internal Validation Fails

EPIC SEPSIS MODEL, JAMA INTERNAL MEDICINE, 2021

Epic Systems deployed a sepsis prediction algorithm to hundreds of hospitals 미국 전역.

Epic's internal validation showed excellent performance. Hospitals trusted it.

그런 다음 JAMA Internal Medicine에서 외부 검증 연구를 진행했습니다.

• The model missed 67% of sepsis cases
• It triggered thousands of false alarms
• Nurses developed severe "alert fatigue"

이 모델은 동일한 시스템의 과거 데이터를 기반으로 검증되었으며 실제 임상 환경에서는 테스트된 적이 없습니다.

Wong A et al. JAMA Intern Med. 2021;181(8):1065-1070

THE LESSON

내부 검증은 외부 검증이 아닙니다. 개발 중에 작동하는 모델이 배포에 실패할 수 있습니다. 항상 실제 상황에서 검증하십시오.

중지 문제

숨겨진 위험

능동 학습 심사를 언제 중단합니까?

당신도 중단한다면 조기: 관련 연구를 놓치게 됩니다
너무 늦게 중단하면: 효율성을 잃게 됩니다

알고리즘은 모든 것을 찾았을 때 알려줄 수 없습니다. 남은 것의 순위만 매깁니다.

There is no perfect stopping rule. Every rule trades recall for efficiency.

CRITICAL POINT

You must 중지 규칙 검증 by manually checking a random sample of unscreened records.

AI Screening Decision Tree

AI 선별을 사용해야 합니까?

Large Reference Set?

↓

<500 refs

Manual OKAI 오버헤드는 그럴 가치가 없습니다

500-2000 refs

AI helpfulModerate efficiency gain

>2000 refs

AI essentialMajor time savings

↓

Always validate with random sampleReport methodology in paper

"기계가 바늘을 찾습니다 더 빠르게
but it cannot guarantee none remain in the haystack.
순위를 신뢰하고 중지를 확인하고
항상 수행한 작업을 보고하세요."

도우미를 꿈꿔보지 않으셨나요
who reads every paper and fills every cell,
who never tires, never errs,
who extracts perfectly?

그 보조자는 존재하지 않습니다.

추출 정확도 문제

GPT-4 데이터 추출 STUDY, 2024

연구원들은 100개의 RCT 논문에서 데이터를 추출하기 위해 GPT-4를 테스트했습니다.

Results:
• Sample sizes: 89% accurate
• Effect estimates: 76% accurate
• Confidence intervals: 71% accurate
• Risk of bias judgments: 62% agreement with humans

A 24% error rate 실제로 추정하면 대략 4개 연구 중 1개는 메타 분석에서 잘못된 데이터를 포함한다는 의미입니다.

Guo Y et al. J Clin Epidemiol. 2024;165:111203

제작 문제

GPT-4 HALLUCINATIONS IN SYSTEMATIC REVIEWS, 2023

연구원들은 체계적 검토 논문에서 데이터 추출을 위해 GPT-4를 테스트했습니다. 모델에 PDF가 제공되고 표본 크기, p-값 및 효과 추정치를 추출하도록 요청되었습니다.

GPT-4 confidently provided all requested numbers with precise formatting.

But 추출의 23%는 "환각"—원본 텍스트에 근거가 없는 숫자입니다.

In one case, the model fabricated a statistically significant result (p=0.003) 실제로 발견된 연구에서 no significant effect (p=0.42).

모델의 신뢰도는 실제 데이터와 조작된 데이터를 구분할 수 없었습니다.

체계적인 검토 AI 검증 연구, 2023

THE LESSON

LLM은 정량적 데이터에 대해 100% 사람의 검증을 요구합니다. 지름길은 없습니다. 소스에 대해 모든 숫자를 확인해야 합니다.

LLM 데이터 추출 작업 흐름

Safe LLM Extraction Protocol

PDF/Full Text

↓

LLM은 데이터를 추출합니다Structured prompt

↓

Human verifies 100%NOT sampling

↓

Discrepancy?

Yes

Human value usedDocument error

No

ProceedLog verification

추출을 위한 프롬프트 엔지니어링

# Example extraction prompt

Extract 이 RCT에서 다음:

1. Sample size (intervention arm): [number]
2. Sample size (control arm): [number]
3. Primary outcome definition: [text]
4. Effect estimate: [number with unit]
5. 95% CI: [lower, upper]
6. p-value: [number]

If not reported, write "NR"
If unclear, write "UNCLEAR: [reason]"

# Provide exact quotes for verification

When LLMs Help vs. Hurt

LLM Extraction Value Assessment

Standardized fields (author, year) ✓ High accuracy

Simple numeric (sample size) ✓ Usually reliable

Complex numeric (adjusted OR) ⚠ Often wrong model

Composite outcomes ⚠ Misses components

Intention-to-treat vs per-protocol ✗ Frequently confused

Subgroup data ✗ High error rate

"The LLM extracts plausible numbers,
반드시 그런 것은 아닙니다. 정확한 숫자입니다.
최종 답변이 아닌 빠른 초안입니다.
Every cell must be verified by human eyes."

결코 동의하지 않는 판사
who reads every methods section,
who assesses bias without bias,
를 원하지 않으셨나요? themselves?

RobotReviewer

MARSHALL ET AL., NATURE MACHINE INTELLIGENCE, 2019

RobotReviewer uses machine learning to assess risk of bias in RCTs.

Validation against Cochrane assessments:
• Random sequence generation: 71% agreement
• Allocation concealment: 65% agreement
• Blinding of participants: 69% agreement
• Blinding of outcome assessment: 62% agreement

Human inter-rater agreement is typically 70-80%.

RobotReviewer approaches but does not exceed human performance.

Marshall IJ et al. Nat Mach Intell. 2019;1:115-117

RoB Automation Decision Tree

When to Use Automated RoB

Risk of Bias Assessment

↓

Review Type?

Rapid review

Automated OKAcknowledge limitation

Scoping review

Automated OKIf RoB included

전체 체계적 검토

Preliminary onlyHuman verification required

Cochrane review

Human requiredDraft support only

Limitations of Automated RoB

What Machines Cannot Assess

✗ Outcome-specific bias (RoB 2 domain 4)

✗ Selective reporting based on protocol comparison

✗ Contextual judgment (Is this design appropriate?)

✗ Cross-paper inconsistencies (multiple reports)

✗ 결과 해석에 대한 자금 영향

기본적 한계

AI reads what is written. Bias assessment often requires judging what is not written.

RoB용 하이브리드 작업 흐름

Best Practice Protocol

Full Text PDFs

↓

RobotReviewer screeningFlags potential issues

↓

Reviewer 1 assessesUsing AI output as reference

↓

Reviewer 2 independentlyBlinded to AI output

↓

Consensus meeting

↓

Final assessmentHuman decision documented

"로봇이 메소드를 읽습니다. 섹션
but cannot read between the lines.
판단하는 것이 아니라 신고하는 데 사용하세요.
판결은 사람이 해야 합니다."

작성자를 원하지 않으셨나요
who drafts your protocol in minutes,
who knows every PRISMA item,
who writes in perfect academic prose?

프로토콜 초안 작성을 위한 LLM

✓

Structure
generation

✓

Boilerplate
text

⚠

PICO
formulation

✗

Search
strategy

가치 제안

LLM이 초안을 작성할 수 있습니다 구조 및 표준 언어. scientific decisions.

검색 전략 위험

TESTED ACROSS MULTIPLE LLMs, 2023-2024

Researchers asked GPT-4 and Claude to generate MEDLINE search strategies.

Common errors:
• Invented MeSH terms that don't exist
• Wrong field codes (e.g., [tiab] vs [tw])
• 연구 질문에서 핵심 개념 누락
• Overly narrow strategies missing relevant studies
• Syntax errors that wouldn't execute

An information specialist must write or validate all search strategies.

2023~2024년 다중 검증 연구

Protocol Writing Decision Tree

LLM Use in Protocol Development

Protocol Section

↓

Background/Rationale

LLM helpfulDraft + fact-check

Methods structure

LLM helpfulTemplate generation

PICO criteria

Human decidesLLM refines wording

Search strategy

Human/SpecialistAI too unreliable

Safe LLM Protocol Workflow

Quality Assurance Steps

1 Define PICO yourself (human scientific decision)

2 Ask LLM to draft protocol sections

3 Verify all cited guidelines exist (PRISMA, Cochrane)

4 Write search strategy with information specialist

5 Check all methodological decisions are defensible

6 Disclose AI assistance in protocol

7 인간 검증을 등록하세요. version

"기계는 다음과 같은 단어를 쓸 수 있습니다.
but it cannot make the decisions.
질문을 정의하고 방법을 선택합니다.
프로토콜은 귀하의 것입니다. AI는 타이피스트입니다."

발간되기 전에 시대에 뒤떨어진 체계적 리뷰
게시되기 전에 오래된 내용입니다.
while new trials accumulated in the literature,
unsynthesized, unknown?

리빙 리뷰를 보지 않으셨나요? 문제

COVID-19 증거 TSUNAMI, 2020

팬데믹이 발생한 첫 해에:

• 100,000+ COVID papers published
• Traditional reviews obsolete within weeks
• Clinicians made decisions on incomplete evidence

COVID-NMA 컨소시엄이 사용했습니다 AI-assisted surveillance to monitor new trials daily and update meta-analyses weekly.

이 요구 사항: 자동 검색 모니터링, AI 스크리닝 우선 순위 지정, 신속한 데이터 추출 워크플로우 및 지속적인 통계 업데이트.

Defined in Cochrane Living Reviews guidance

생활 리뷰를 위한 AI 구성 요소

Automated Surveillance Stack

생활 리뷰 시스템

↓

Auto-searchDaily/weekly runs

AI triagePriority screening

Rapid extractionLLM-assisted

Auto-updateCumulative MA

↓

Human oversight at each stage출판 전 편집 검토

지속적인 모니터링을 위한 도구

PubMed Alerts

Free email alerts
Saved searches

Basic

Epistemonikos

Systematic review
database

AI-curated

Covidence

Auto-import
Living mode

Subscription

DistillerSR

AI screening
+ monitoring

Enterprise

생활 리뷰 결정 프레임워크

리뷰를 "실시간"으로 작성해야 하는 경우

이것이 라이브여야 할까요?

↓

Criteria Check

Priority questionClinical importance

Evidence evolvingActive trial pipeline

Resources secured2년 이상 자금 지원

↓

All three required for living status

"기계는 당신이 자는 동안
문헌을 감시합니다.
But someone must wake to judge
새로운 증거가 진실을 바꾸는지 여부를 확인합니다."

검증 없이 머신을 사용하면
어떤 오류가 발생했는지 알 수 없습니다.

머신이 생산하는 모든 것을 검증하면
what time have you saved?

답은 in strategic verification.

검증 역설

THE DILEMMA

Full verification = No time savings
No verification = Unknown error rate
Strategic verification = Validated efficiency

Verification Strategy by Risk

High-risk tasks

100% human review데이터 추출, RoB

Medium-risk tasks

Sample validationScreening decisions

Low-risk tasks

Spot checksDeduplication

When Oversight Catches Bias

COCHRANE MACHINE LEARNING PILOT, 2022

Cochrane tested ML-assisted risk of bias assessment to accelerate systematic reviews.

알고리즘은 검토자와 85% 일치—seemingly impressive.

그러나 QA팀은 15%의 불일치를 분석한 결과 패턴:

The AI was systematically biased toward rating industry-funded trials as low risk.

훈련 데이터에는 제약 회사 시험에 대한 "낮은 위험" 레이블이 더 많이 포함되어 있습니다. 알고리즘은 근본적인 방법론적 우려를 이해하지 않고 이 상관 관계를 학습했습니다.

Human oversight caught the pattern before any biased reviews were published.

Cochrane Methods Group 파일럿 연구, 2022

THE LESSON

일치 분석에서는 체계적인 편향이 드러났습니다. 전반적인 정확도가 높으면 위험한 패턴을 숨길 수 있습니다. AI가 얼마나 자주 실패하는지가 아니라 항상 어디서 어떻게 실패하는지 분석하세요.

AI 지원 검토를 위한 QA 프레임워크

Minimum Quality Standards

1 Pre-specify AI use in protocol (which tools, which tasks)

2 Document AI settings (model version, prompts, parameters)

3 Validate screening with random sample (calculate recall estimate)

4 AI로 인해 프로토콜에서 추출된 모든 데이터 against source documents

5 Human RoB assessment (AI as preliminary only)

6 Track error rates per AI task

7 Report transparently in methods section

Reporting AI Use (PRISMA-S)

보고서에 보고할 내용

• Which AI tools were used (name, version, date)
• Which tasks were AI-assisted
• What validation was performed
• What error rates were observed
• What human oversight was maintained
• Any deviations 검증하세요. 제한 사항

EMERGING STANDARD

Journals increasingly require AI use statements. PRISMA-S extension for search reporting includes automation.

전체 AI-MA 워크플로

Integrated Human-AI Process

Protocol (Human + LLM draft)

↓

Search (Human/Specialist)

↓

Screening (AI prioritize + Human decide)

↓

Extraction (LLM draft + Human verify 100%)

↓

RoB (AI flag + Human assess)

↓

Analysis (Human)

↓

Interpretation (Human)

"The machine is neither colleague nor replacement.
강력하고 빠르며 오류가 발생하기 쉬운 도구입니다.
Document what you used. Validate what it produced.
책임은 사용자에게 있습니다."

고려해 보셨나요
whose labor trained the model,
whose data it consumed without consent,
whose jobs it may displace?

숨겨진 노동

KENYAN DATA LABELERS, TIME MAGAZINE 2023

ChatGPT는 RLHF라는 프로세스를 통해 "안전"하게 만들어졌습니다. 인간으로부터의 강화 학습 피드백.

피드백을 제공하는 인간은 유해하고 폭력적이며 충격적인 콘텐츠를 읽고 라벨을 붙이는 대가로 less than $2 per hour 임금을 받는 케냐의 노동자들입니다.

그들은 작업으로 인해 심리적 트라우마를 겪었습니다.

사용하는 모든 AI 도구는 인간의 노동에 의존합니다.

Perrigo B. Time Magazine. 2023 Jan 18.

Automating Inequality

UK A-LEVEL ALGORITHM SCANDAL, 2020

COVID-19로 인해 영국에서 A-레벨 시험이 취소되었을 때 정부는 과거 학교 성과를 기반으로 학생 성적을 예측하는 알고리즘을 사용했습니다.

The results:

• Students from disadvantaged schools were systematically downgraded
• Students from 사립 학교가 업그레이드되었습니다
• 알고리즘은 학생들이 성공

After massive public outcry, 성적의 40%가 수정되었습니다.

알고리즘이 인코딩되었습니다 historical inequality as prediction. 역사적으로 더 적은 수의 학생을 대학에 보낸 학교는 개별 학생의 능력에 관계없이 불이익을 받았습니다.

UK Office of Qualifications and Examinations Regulation, 2020

THE LESSON

AI는 과거 데이터가 체계적 불평등을 반영하는 경우 대규모 편견을 자동화할 수 있습니다. 해당 데이터에 대해 훈련된 알고리즘은 데이터를 영속화하고 증폭시킵니다.

연구에서 AI를 위한 윤리적 프레임워크

Questions to Ask

1 Transparency: Can I fully disclose how AI was used?

2 Accountability: AI 오류에 대한 책임은 누구에게 있습니까?

3 Equity: Does AI access create research inequities?

4 Labor: 이 도구는 누구의 작업으로 가능하게 되었습니까?

5 Environment: What is the carbon cost of model training?

6 Reproducibility: Can others replicate my AI-assisted work?

Authorship and AI

ICMJE POSITION

AI tools cannot be listed as authors.

Authors must take responsibility for AI-generated content.

AI use must be disclosed in methods or acknowledgments.

YOUR RESPONSIBILITY

AI가 환각을 일으키고 이를 게시하면 책임은 귀하에게 있습니다— OpenAI도 아니고 Anthropic도 아니고 도구도 아닙니다.

"기계에는 아무런 장치도 없습니다. 양심.
데이터가 사실인지는 상관하지 않습니다.
누가 피해를 입었는지 알 수 없도록 훈련합니다.
부족한 양심이 있어야 합니다."

앞으로의 길

증거 합성의 AI가 가는 방향

Emerging Capabilities

Multimodal AI

Extract from
figures/tables

2024-2025

Agent Systems

Multi-step
workflows

Emerging

RAG Systems

Retrieval-augmented
generation

Active research

Fine-tuned Models

MA-specific
training

In development

변화하지 않을 것

Enduring Human Requirements

★ 정의 연구 문제(임상 판단)

★ Interpreting clinical significance (domain expertise)

★ Assessing applicability (contextual knowledge)

★ Making recommendations (value judgments)

★ Taking responsibility (ethical accountability)

THE CONSTANT

AI가 역학을 가속화합니다.
과학은 여전히 인간입니다.

미래를 준비합니다

Skills to Develop

Future-Ready Researcher

↓

Prompt engineeringGetting good AI outputs

Validation methodsKnowing when AI errs

Core methodsAI cannot replace

↓

최고의 AI 사용자는 최고의 방법론자입니다Understanding enables oversight

"The machine grows stronger each year.
그러나 질문은 동일합니다.
What is true? What helps patients?
AI가 검색을 지원할 수 있습니다.
오직 사용자만이 답변을 제공할 수 있습니다."

지식 테스트

데이터 추출에 LLM을 사용할 때의 주요 제한 사항은 무엇입니까?

너무 느립니다

They can generate plausible but incorrect data (hallucinations)

They cannot read PDFs

너무 느립니다 비용이 많이 듭니다

When using AI screening (e.g., ASReview), what must you always do?

Trust the AI completely after training

Screen only the top 10% of ranked records

무작위 샘플을 사용하여 중지 규칙 검증

여러 AI 도구를 동시에 사용

어떤 작업에 대해 AI가 최종 의사결정자가 되어서는 안 됩니까?

Deduplication

Screening prioritization

결과의 임상적 해석

Reference formatting

References

Key Sources

Van de Schoot R et al. Nat Mach Intell. 2021;3:125-133. [ASReview]
Marshall IJ et al. Nat Mach Intell. 2019;1:115-117. [RobotReviewer]
Guo Y et al. J Clin Epidemiol. 2024;165:111203. [GPT-4 extraction]
Mata v. Avianca, 22-cv-1461 (S.D.N.Y. 2023). [Hallucination case]
Perrigo B. Time Magazine. 2023 Jan 18. [AI labor ethics]
Elliott JH et al. J Clin Epidemiol. 2017;91:23-30. [Living reviews]
Cochrane Handbook 2023. Chapter on automation.
ICMJE. Recommendations on AI authorship. 2023.
Rethlefsen ML et al. J Med Libr Assoc. 2021. [PRISMA-S]
Wang S et al. Syst Rev. 2023;12:178. [AI screening validation]

✔

Course Complete

"당신은 이제 Silicon Scribe를 알 수 있습니다.
its powers and its limits.
속도를 높이기 위해 사용합니다. 교체하세요.
Validate what it produces.
당신이 한 일을 문서화하세요.
그리고 항상 기억하세요:
기계는 다음 단어를 예측합니다.
그 단어가 맞는지 판단해야 합니다. true."

ASReview: Step-by-Step Tutorial

설치부터 중지 결정까지

Step 1: Installation

# Option A: Python pip (recommended)
pip install asreview

# 옵션 B: 다운로드 데스크톱 앱
# https://asreview.nl/download/

# Launch ASReview LAB
asreview lab

REQUIREMENTS

• Python 3.8+(pip 설치용)
• OR: Windows/Mac desktop app (no Python needed)
• Your references in RIS, CSV, or EndNote XML format

Step 2: Create Project & Import

Project Setup Workflow

New Project

↓

프로젝트 이름 지정Descriptive, include date

↓

Import referencesRIS/CSV/XML file

↓

ASReview deduplicatesCheck count matches expected

↓

사전 지식 준비

Step 3: Add Prior Knowledge

CRITICAL STEP

모델은 초기 결정을 통해 학습합니다.
You need 관련성이 있는 것과 관련되지 않은 것 모두 examples.

Prior Knowledge Strategy

1 Add 5-10 known relevant 연구(범위 지정 검색에서)

2 Search for clearly irrelevant topics (random sample)

3 Mark 10-20 irrelevant as negative examples

4 Aim for ~1:2 ratio (relevant:irrelevant) to start

WARNING

Poor prior knowledge = poor model performance.
Garbage in, garbage out.

Step 4: Screen with Active Learning

Screening Loop

ASReview presents record

↓

Your decision

Relevant전문에 포함

IrrelevantExclude

↓

Model updatesRe-ranks remaining

↓

Next most likely relevantRepeat until stopping rule

Step 5: Stopping Decision

Stopping Rules Compared

Consecutive irrelevant (50-200) Common, but no recall guarantee

% of total screened (e.g., 50%) Predictable effort, variable recall

All records screened 100% recall, no time savings

Statistical stopping (Busfelder) Evidence-based, requires plugin

VALIDATION REQUIREMENT

After stopping: manually screen random sample of unscreened records.
Report estimated recall with confidence interval.

"도구는 간단하지만 결정은 그렇지 않습니다.
Feed it good examples. Check when you stop.
프로젝트 파일을 내보내세요. 이것이 감사 추적입니다."

프롬프트 엔지니어링 라이브러리

Validated prompts for meta-analysis tasks

Prompt Principles

안정적인 LLM 출력을 위해

1 Be specific: Define exact fields and formats

2 Provide examples: Show expected output format

3 Request uncertainty: "NR" 또는 "UNCLEAR" 플래그를 요청하세요.

4 Demand quotes: Require source text for verification

5 Limit scope: One task per prompt, not everything at once

프롬프트 1: RCT 데이터 추출

이 RCT에서 다음을 추출합니다. 각 필드에 대해 다음을 제공하십시오.
- The value
- 논문의 정확한 인용문(인용부호)
- 보고되지 않은 경우 "NR", 모호한 경우 "UNCLEAR"

FIELDS:
1. Intervention group sample size (ITT): [n]
2. Control group sample size (ITT): [n]
3. Primary outcome definition: [text]
4. Primary outcome: intervention events/total: [x/n]
5. Primary outcome: control events/total: [x/n]
6. Risk ratio (95% CI): [RR (lower, upper)]
7. Follow-up duration: [weeks/months]

OUTPUT FORMAT: 각 필드에 대해 "값" 및 "인용문"이 포함된 JSON

프롬프트 2: 연구 특성

연구 특성을 추출합니다. 확인을 위해 정확한 견적을 제공하세요.

FIELDS:
1. Study design: [RCT / Cluster RCT / Crossover / Other]
2. Country/countries: [list]
3. Setting: [hospital / primary care / community / other]
4. Recruitment period: [start date - end date]
5. Funding source: [text]
6. Trial registration: [ID number or "NR"]
7. Conflicts of interest declared: [Yes/No/NR]

If information is in supplementary materials, note "See Supplement".
If truly not reported anywhere, mark "NR".

Prompt 3: Population Characteristics

Extract baseline population characteristics.
개입 및 통제 그룹에 대해 별도로 보고합니다.

FIELDS (per group):
1. N randomized: [n]
2. N analyzed: [n]
3. Age: [mean (SD) or median (IQR)]
4. Sex (% female): [%]
5. Key inclusion criteria: [text]
6. Key exclusion criteria: [text]
7. Disease severity at baseline: [measure and value]

NOTE: If groups combined only, report combined with note.

Prompt 4: Risk of Bias Screening

NOTE: 이는 예비 신고에만 사용됩니다.
Human assessment required for final judgment.

For each RoB 2 domain, identify relevant text:

D1 Randomization:
- 시퀀스 생성 방법 : [quote 또는 NR]
- 할당 은폐 방법 : [견적 또는 NR]

D2 Deviations:
- Blinding of participants: [quote or NR]
- Blinding of personnel: [quote or NR]

D3 Missing data:
- Attrition rates: [intervention: x%, control: y%]
- 누락된 데이터 처리: [인용 또는 NR]

DO NOT make judgments. Only extract quotes.

"프롬프트는 기계와의 계약이다.
질문하는 내용을 정확하게 설명하세요.
모든 답변에 대한 증거를 요구하십시오.
Verify every output against the source."

체계적인 리뷰를 작성할 수는 없습니다.
하지만 당신은 그럴 것이다 read them.

AI 지원 여부를 어떻게 알 수 있나요?
was done well or poorly?

The IBM Watson Oncology Failure

MD ANDERSON CANCER CENTER, 2017

IBM Watson for Oncology는 암 치료를 권장하도록 교육을 받았습니다.

After spending $62 million, MD Anderson은 프로젝트를 취소했습니다.

Internal documents showed Watson made "안전하지 않고 잘못된" 치료 권장 사항. 실제 환자 데이터가 아닌 합성 사례에 대해 훈련되었습니다.

AI는 자신감 있어 보였다. 추천은 위험했습니다.

Lesson: AI confidence ≠ AI correctness

STAT News investigation, 2017; IEEE Spectrum 2019

AI 지원 검토에 대한 질문

메소드에서 찾아야 할 사항

1 Did they AI 도구 이름 지정 used? (version, date)

2 Did they specify which tasks were AI-assisted?

3 Did they validate AI outputs? How?

4 AI 스크리닝의 경우 : 무엇 stopping rule? What estimated recall?

5 AI 추출의 경우: 이전 100% human verified?

6 Was there human oversight of all AI decisions?

Red Flags in AI-Assisted Reviews

Warning Signs

"AI screened all titles" No human involvement?

"GPT 추출 데이터" No verification mentioned?

"Stopped after 500 consecutive irrelevant" No recall estimate?

"AI-generated protocol" Human decisions unclear?

No AI tools mentioned but clearly AI-written Hidden AI use

환자와 임상의를 위한

알아야 할 사항

Good AI use: Speeds up the work, human verifies
Bad AI use: Replaces human judgment, no validation

An AI-assisted review can be trustworthy—if done right.

Simple Questions to Ask

? "이 리뷰에 AI가 사용되었습니까?"

? "AI 결과를 사람이 확인했습니까?"

? "Could AI have missed important studies?"

"AI assistance is not a flaw—it is often an advantage.
But only if validated, only if disclosed.
질문: 기계를 확인했습니까?
답변이 불분명한 경우에도 마찬가지입니다. 검토하세요."

다음을 고려하지 않으셨나요? 연구원
with unstable internet, limited compute,
no institutional subscription,
who still needs to synthesize evidence?

무료 및 오프라인 지원 도구

ASReview

Desktop app
Works offline

FREE

Abstrackr

Web-based
Free accounts

FREE

Rayyan

Free tier
Limited AI

FREEMIUM

RevMan

Cochrane tool
Full MA software

FREE

Offline Workflow

When Internet is Unreliable

Search Phase

↓

도서관/카페: 모든 PDF 다운로드연결 시 일괄 다운로드

↓

Screening Phase

↓

ASReview desktopWorks fully offline

↓

Extraction Phase

↓

Spreadsheet + local PDFsNo AI needed

Low-Cost LLM Alternatives

WHEN API COSTS ARE PROHIBITIVE

• Claude/ChatGPT free tiers: Limited but functional
• Ollama + local models: Free, runs on laptop (requires download)
• Hugging Face inference: Free tier available
• Manual extraction: Still gold standard, just slower

HONEST ASSESSMENT

AI는 필수가 아니라 편리함입니다.
All Cochrane reviews were done without AI.
품질은 옵니다. 도구가 아닌 방법에서.

Resource-Limited Decision Tree

접근 방식 선택

Your Resources

↓

Internet reliability?

Stable

Web tools OKRayyan, Covidence

Unreliable

Desktop toolsASReview offline

None

Manual + spreadsheetsStill valid

"증거는
빠른 인터넷과 유료 구독을 사용하는 사람들뿐만 아니라 모든 사람의 것입니다.
도구는 다를 수 있습니다. 메서드는 그대로 유지됩니다.
Quality synthesis is possible anywhere."

Validation Calculations

AI 검증을 위한 샘플 크기

Estimating Recall After AI Screening

THE PROBLEM

중지했습니다. 5,000개 기록 중 1,000개를 검사합니다.
관련 연구를 모두 찾았다고 얼마나 확신하시나요?

Validation Sampling

Unscreened records (n=4000)

↓

Random sample (n=400)10% or at least 200

↓

Manual screening

0 relevant foundRecall ≈ 95-100%

Relevant foundScreen all remaining

Sample Size Formula

95% 회수 신뢰도

                    n = ln(1 - confidence) / ln(1 - prevalence)

Example:

                    If prevalence of relevant = 1% (0.01)

                    For 95% confidence (0.95):

                    n = ln(1 - 0.95) / ln(1 - 0.01)

                    n = ln(0.05) / ln(0.99)

                    n ≈ 299 records to sample

Quick Reference Table

검증을 위한 샘플 크기

Prevalence 0.5%, 95% conf 598 records

Prevalence 1%, 95% conf 299 records

Prevalence 2%, 95% conf 149 records

Prevalence 5%, 95% conf 59 records

Practical minimum 200 records (conservative)

귀하의 보고 검증

방법 예시 텍스트:

"
active learning. Screening ceased after 150 consecutive
irrelevant records, having screened 1,247 of 4,892 records
(25%). To validate recall, we manually screened a random
sample of 300 unscreened records. No additional relevant
연구가 식별되었으며 제목/초록 심사에 ASReview LAB(v1.2)를 사용하여 추정 재현율이 95% 이상임을 시사합니다.
(binomial 95% CI: 91-100%)."

"검증이 불가능합니다. 선택 사항 - 효율성의 대가입니다.
Calculate your sample. Screen it manually.
발견한 내용을 보고하세요. 당신이 놓쳤을 수도 있는 것을 인정하라."