The Silicon Scribe: AI によるメタ分析

データを抽出する
ten thousand abstracts in an hour,
と書かれたマシンについて聞いたことはありませんかあなたが眠っている間に、
that promises to 単調な仕事から解放されます?

証拠合成における AI 革命

67%

Workload reduction
with AI screening

95%

Recall achievable
アクティブラーニングによる

10x

Faster screening
than manual

THE PROMISE

AI は要約をスクリーニングし、データを抽出し、バイアスのリスクを評価し、新しい情報を監視できます証拠—if used correctly.

When AI Fails in Healthcare

IBM WATSON ONCOLOGY, MD ANDERSON, 2013-2017

2013 年、MD アンダーソンがんセンターは IBM Watson と提携して、がん治療の推奨に革命をもたらしました。プロジェクトコスト $62 million.

2017 年までにプロジェクトは放棄されました。ワトソンの推奨事項は実際の患者データではなく、 「安全ではなく間違っている」 in multiple cases.

In one documented case, Watson recommended a treatment that would cause severe bleeding in a patient already on blood thinners.

The core problem: Watson had been trained primarily on hypothetical cases created by physiciansであることが判明しました。 AI は、実際の結果から学ぶのではなく、専門家の意見を模倣することを学習しました。

Stat News, 2017; IEEE Spectrum, 2019

THE LESSON

合成データまたは仮説データで訓練された AI は、実際の患者では失敗します。トレーニングデータと現実の間のギャップは致命的になる可能性があります。

幻覚問題

LAWYERS SANCTIONED, NEW YORK, 2023

Attorneys used ChatGPT to research case law for a federal court brief.

AI は、完全な引用、引用、ページ番号を付けて 6 つの事件を引用しました。

どの事件も存在しませんでした。

裁判官は、引用が「意味不明」であると判断しました。

これはバグではありません。これは大規模な言語モデルの仕組みです。検証された真実ではなく、もっともらしいテキストを予測します。

Mata v. Avianca, Inc., 22-cv-1461 (S.D.N.Y. 2023)

中心的な質問

When to Trust AI in Meta-Analysis

AI Tool Output

↓

Task Type?

Ranking/Prioritization

Lower riskHuman reviews top-ranked

Binary Decision

Medium riskNeeds validation

Text Generation

High riskHallucination possible

AI にできることとできないこと

Honest Assessment

Screening prioritization ✓ Excellent

Duplicate detection ✓ Excellent

データ抽出 (構造化) ⚠ Needs verification

Risk of bias assessment ⚠ Preliminary only

ライティングプロトコル/メソッド ⚠ Draft only

Statistical analysis ✗ Human required

Clinical interpretation ✗ Human required

「マシンは読み取りは速いですが、理解できません。
真実ではなく、次の単語を予測します。
置き換えるのではなく、加速するために使用します。
The judgment must remain yours."

重要な 1 つの研究
who screened ten thousand titles by hand,
whose eyes grew tired, whose attention wandered,
を見逃したレビュー担当者 を見たことがありませんか??

ツール

ASReview

Active learning
Open source

Free

Rayyan

AI recommendations
Collaboration

Freemium

Abstrackr

Semi-automated
Web-based

Free

EPPI-Reviewer

Priority screening
Full workflow

Subscription

How Active Learning Works

ASReview Workflow

Import References

↓

Screen seed papers10-20 known relevant

↓

AI learns patterns決定ごとに更新

↓

Prioritizes likely relevantMost promising first

↓

Stopping rule?

Consecutive irrelevante.g., 100-200 in row

% screened例: リコールチェックで 50%

実際のパフォーマンスデータ

VAN DE SCHOOT ET AL., 2021

Systematic evaluation of ASReview across 4 datasets:

• PTSD dataset: 95% recall after screening 40% of records
• Software fault prediction: 95% recall after 20%
• Virus metagenomics: 95% recall after 10%

Average workload reduction: 67-95% depending on prevalence.

But: Performance varies by topic and prevalence. Low-prevalence topics show greater efficiency gains.

Van de Schoot R et al. Nat Mach Intell. 2021;3:125-133

When AI-Assisted Screening Works

ASREVIEW AND COCHRANE COVID-19 RESPONSE, 2020

During the COVID-19 pandemic, Cochrane needed to screen 50,000+ citations weekly to keep reviews current.

ASReview のアクティブラーニングシステムは人間による厳格な監視のもとに導入されました。

• Reduced human screening workload by 75%
• Missed fewer than 1% of relevant studies
• Validated at every stage by human reviewers

成功: human-in-the-loop validation at every stage。AI が優先順位を付けましたが、最終的な判断は人間が行い、AI が除外したレコードのサンプルをチェックしました。

Cochrane COVID-NMA consortium, 2020-2021

THE LESSON

AI は人間の判断力を強化します。それはそれに代わるものではありません。成功は自動化ではなくパートナーシップから生まれます。

When Internal Validation Fails

EPIC SEPSIS MODEL, JAMA INTERNAL MEDICINE, 2021

Epic Systems deployed a sepsis prediction algorithm to hundreds of hospitals 米国全土で。

Epic's internal validation showed excellent performance. Hospitals trusted it.

その後、JAMA内科での外部検証研究が行われました。

• The model missed 67% of sepsis cases
• It triggered thousands of false alarms
• Nurses developed severe "alert fatigue"

モデルは同じシステムからの履歴データで検証されており、実際の臨床環境でテストされたことはありませんでした。

Wong A et al. JAMA Intern Med. 2021;181(8):1065-1070

THE LESSON

内部検証は外部検証ではありません。開発時には機能するモデルでも、展開時には失敗する可能性があります。常に現実世界のコンテキストで検証してください。

停止の問題

隠れた危険

アクティブラーニングによるスクリーニングをいつ停止しますか?

あまりにも早く停止した場合: 関連性を見逃します研究
停止が遅すぎると、 効率の向上が失われます

アルゴリズムは、いつすべてを見つけたかを通知できません。残ったもののみがランク付けされます。

There is no perfect stopping rule. Every rule trades recall for efficiency.

CRITICAL POINT

You must 停止ルールを検証してください by manually checking a random sample of unscreened records.

AI Screening Decision Tree

AI スクリーニングを使用する必要がありますか?

Large Reference Set?

↓

<500 refs

Manual OKAI オーバーヘッドには価値がありません

500-2000 refs

AI helpfulModerate efficiency gain

>2000 refs

AI essentialMajor time savings

↓

Always validate with random sampleReport methodology in paper

「機械は針をより速く見つけます。
but it cannot guarantee none remain in the haystack.
ランキングを信頼し、停止、
そして常に自分が行ったことを報告してください。"

夢にも思わなかったのかアシスタント
who reads every paper and fills every cell,
who never tires, never errs,
who extracts perfectly?

そのアシスタントは存在しません。

抽出精度の問題

GPT-4 データ抽出研究、2024 年

研究者は 100 件の RCT からデータを抽出するために GPT-4 をテストしました

Results:
• Sample sizes: 89% accurate
• Effect estimates: 76% accurate
• Confidence intervals: 71% accurate
• Risk of bias judgments: 62% agreement with humans

A 24% error rate 実質的な推定値は、およそ 4 件に 1 件の研究がメタ分析で間違ったデータを含むことを意味します。

Guo Y et al. J Clin Epidemiol. 2024;165:111203

捏造問題

GPT-4 HALLUCINATIONS IN SYSTEMATIC REVIEWS, 2023

研究者らはシステマティックレビュー論文からのデータ抽出について GPT-4 をテストしました。モデルには PDF が与えられ、サンプルサイズ、p 値、および効果の推定値を抽出するよう依頼されました。

GPT-4 confidently provided all requested numbers with precise formatting.

But 抽出の 23% は「幻覚」- 原文に根拠のない数値です。

In one case, the model fabricated a statistically significant result (p=0.003) 実際に見つかった研究からの no significant effect (p=0.42).

信頼性は本物のデータと捏造されたデータの区別がつきませんでした。

体系的レビュー AI 検証研究、2023 年

THE LESSON

LLM では定量データについて 100% 人による検証が必要です。近道はありません。すべての数値をソースと照合する必要があります。

LLM データ抽出ワークフロー

Safe LLM Extraction Protocol

PDF/Full Text

↓

LLM はデータを抽出しますStructured prompt

↓

Human verifies 100%NOT sampling

↓

Discrepancy?

Yes

Human value usedDocument error

No

ProceedLog verification

抽出のためのエンジニアリングを要求します

# Example extraction prompt

Extract この RCT から次のことがわかります:

1. Sample size (intervention arm): [number]
2. Sample size (control arm): [number]
3. Primary outcome definition: [text]
4. Effect estimate: [number with unit]
5. 95% CI: [lower, upper]
6. p-value: [number]

If not reported, write "NR"
If unclear, write "UNCLEAR: [reason]"

# Provide exact quotes for verification

When LLMs Help vs. Hurt

LLM Extraction Value Assessment

Standardized fields (author, year) ✓ High accuracy

Simple numeric (sample size) ✓ Usually reliable

Complex numeric (adjusted OR) ⚠ Often wrong model

Composite outcomes ⚠ Misses components

Intention-to-treat vs per-protocol ✗ Frequently confused

Subgroup data ✗ High error rate

"The LLM extracts plausible numbers,
必ずしも正しいとは限りません
これは簡単な初稿であり、最終的な答えではありません。
Every cell must be verified by human eyes."

決して意見を異にしない裁判官を望んでいませんか
who reads every methods section,
who assesses bias without bias,
決して反対しない人 themselves?

RobotReviewer

MARSHALL ET AL., NATURE MACHINE INTELLIGENCE, 2019

RobotReviewer uses machine learning to assess risk of bias in RCTs.

Validation against Cochrane assessments:
• Random sequence generation: 71% agreement
• Allocation concealment: 65% agreement
• Blinding of participants: 69% agreement
• Blinding of outcome assessment: 62% agreement

Human inter-rater agreement is typically 70-80%.

RobotReviewer approaches but does not exceed human performance.

Marshall IJ et al. Nat Mach Intell. 2019;1:115-117

RoB Automation Decision Tree

When to Use Automated RoB

Risk of Bias Assessment

↓

Review Type?

Rapid review

Automated OKAcknowledge limitation

Scoping review

Automated OKIf RoB included

完全な体系的レビュー

Preliminary onlyHuman verification required

Cochrane review

Human requiredDraft support only

Limitations of Automated RoB

What Machines Cannot Assess

✗ Outcome-specific bias (RoB 2 domain 4)

✗ Selective reporting based on protocol comparison

✗ Contextual judgment (Is this design appropriate?)

✗ Cross-paper inconsistencies (multiple reports)

✗ 結果の解釈に対する資金の影響

基本的な制限

AI reads what is written. Bias assessment often requires judging what is not written.

RoB のハイブリッドワークフロー

Best Practice Protocol

Full Text PDFs

↓

RobotReviewer screeningFlags potential issues

↓

Reviewer 1 assessesUsing AI output as reference

↓

Reviewer 2 independentlyBlinded to AI output

↓

Consensus meeting

↓

Final assessmentHuman decision documented

"ロボットがメソッドを読み取りますセクション
but cannot read between the lines.
判断するためではなく、フラグを立てるために使用します。
判定は人間でなければなりません。"

ライターを望んでいませんか
who drafts your protocol in minutes,
who knows every PRISMA item,
who writes in perfect academic prose?

プロトコル草案用の LLM

✓

Structure
generation

✓

Boilerplate
text

⚠

PICO
formulation

✗

Search
strategy

価値提案

LLM は草案を作成できます 構造と標準言語。以下を提供する必要があります。 scientific decisions.

検索戦略の危険

TESTED ACROSS MULTIPLE LLMs, 2023-2024

Researchers asked GPT-4 and Claude to generate MEDLINE search strategies.

Common errors:
• Invented MeSH terms that don't exist
• Wrong field codes (e.g., [tiab] vs [tw])
• 研究質問に重要な概念が欠落している
• Overly narrow strategies missing relevant studies
• Syntax errors that wouldn't execute

An information specialist must write or validate all search strategies.

複数の検証研究 2023-2024

Protocol Writing Decision Tree

LLM Use in Protocol Development

Protocol Section

↓

Background/Rationale

LLM helpfulDraft + fact-check

Methods structure

LLM helpfulTemplate generation

PICO criteria

Human decidesLLM refines wording

Search strategy

Human/SpecialistAI too unreliable

Safe LLM Protocol Workflow

Quality Assurance Steps

1 Define PICO yourself (human scientific decision)

2 Ask LLM to draft protocol sections

3 Verify all cited guidelines exist (PRISMA, Cochrane)

4 Write search strategy with information specialist

5 Check all methodological decisions are defensible

6 Disclose AI assistance in protocol

7 人間が検証したものを登録するバージョン

「機械は言葉を書くことができます。
but it cannot make the decisions.
質問を定義するのはあなたです。方法を選択します。
プロトコルはあなたのものです。AI がタイピストです。」

公開前に期限切れとなったシステマティックレビュー
、
while new trials accumulated in the literature,
unsynthesized, unknown?

The Living Reviewを見たことはありませんか問題

新型コロナウイルス感染症の証拠津波、2020年

パンデミックの初年度:

• 100,000+ COVID papers published
• Traditional reviews obsolete within weeks
• Clinicians made decisions on incomplete evidence

新型コロナウイルス感染症コンソーシアムは AI-assisted surveillance to monitor new trials daily and update meta-analyses weekly.

これに必要なもの: 自動検索モニタリング、AI スクリーニングの優先順位付け、迅速なデータ抽出ワークフロー、継続的統計更新。

Defined in Cochrane Living Reviews guidance

リビングレビュー用 AI コンポーネント

Automated Surveillance Stack

リビングレビューシステム

↓

Auto-searchDaily/weekly runs

AI triagePriority screening

Rapid extractionLLM-assisted

Auto-updateCumulative MA

↓

Human oversight at each stage出版前の編集レビュー

継続モニタリング用ツール

PubMed Alerts

Free email alerts
Saved searches

Basic

Epistemonikos

Systematic review
database

AI-curated

Covidence

Auto-import
Living mode

Subscription

DistillerSR

AI screening
+ monitoring

Enterprise

リビングレビューの決定フレームワーク

「リビング」のレビューを行う時期

これはリビングにする必要がありますか?

↓

Criteria Check

Priority questionClinical importance

Evidence evolvingActive trial pipeline

Resources secured2 年以上の資金提供

↓

All three required for living status

「機械は文献を監視する
あなたが寝ている間。
But someone must wake to judge
新しい証拠が真実を変えるかどうか。」

認証を行わずに機械を使用すると、
自分がどのような間違いを犯したかわかりません。

機械が生成するものをすべて検証すると、
what time have you saved?

答えは次のとおりです strategic verification.

検証のパラドックス

THE DILEMMA

Full verification = No time savings
No verification = Unknown error rate
Strategic verification = Validated efficiency

Verification Strategy by Risk

High-risk tasks

100% human reviewデータ抽出、RoB

Medium-risk tasks

Sample validationScreening decisions

Low-risk tasks

Spot checksDeduplication

When Oversight Catches Bias

COCHRANE MACHINE LEARNING PILOT, 2022

Cochrane tested ML-assisted risk of bias assessment to accelerate systematic reviews.

達成されたアルゴリズム 人間の審査員との 85% の同意—seemingly impressive.

しかし、QA チームは 15% の不一致を分析し、次のようなパターンを発見しました。

The AI was systematically biased toward rating industry-funded trials as low risk.

トレーニングデータには、製薬会社の治験に対する「低リスク」ラベルがさらに含まれていました。アルゴリズムは、根底にある方法論上の懸念を理解することなく、この相関関係を学習しました。

Human oversight caught the pattern before any biased reviews were published.

コクランメソッドグループのパイロット研究、2022 年

THE LESSON

不一致分析により、系統的な偏りが明らかになります。全体的な精度が高いと、危険なパターンを隠すことができます。 AI が失敗する頻度だけでなく、どこでどのように失敗するかを常に分析します。

AI 支援レビューのための QA フレームワーク

Minimum Quality Standards

1 Pre-specify AI use in protocol (which tools, which tasks)

2 Document AI settings (model version, prompts, parameters)

3 Validate screening with random sample (calculate recall estimate)

4 抽出されたすべてのデータを検証する against source documents

5 Human RoB assessment (AI as preliminary only)

6 Track error rates per AI task

7 Report transparently in methods section

Reporting AI Use (PRISMA-S)

論文で何を報告するか

• Which AI tools were used (name, version, date)
• Which tasks were AI-assisted
• What validation was performed
• What error rates were observed
• What human oversight was maintained
• Any deviations AI の制限によりプロトコルから除外

EMERGING STANDARD

Journals increasingly require AI use statements. PRISMA-S extension for search reporting includes automation.

完全な AI-MA ワークフロー

Integrated Human-AI Process

Protocol (Human + LLM draft)

↓

Search (Human/Specialist)

↓

Screening (AI prioritize + Human decide)

↓

Extraction (LLM draft + Human verify 100%)

↓

RoB (AI flag + Human assess)

↓

Analysis (Human)

↓

Interpretation (Human)

"The machine is neither colleague nor replacement.
これはツールであり、強力で高速ですが、間違いやすいものです。
Document what you used. Validate what it produced.
責任は依然としてあなたにあります。」

検討しなかったのですか
whose labor trained the model,
whose data it consumed without consent,
whose jobs it may displace?

隠れた労働

KENYAN DATA LABELERS, TIME MAGAZINE 2023

ChatGPT は、RLHF (ヒューマンフィードバックからの強化学習) と呼ばれるプロセスを通じて「安全」にされました。

そのフィードバックを提供した人間はケニアの労働者であり、有給でした。 less than $2 per hour 有毒、暴力的、不穏なコンテンツを読んでラベルを付けること。

彼らは仕事から精神的トラウマを発症しました。

あなたが使用するすべての AI ツールは人間の労働に依存しており、多くの場合目に見えず、賃金が低く、損害を与えていることがよくあります。

Perrigo B. Time Magazine. 2023 Jan 18.

Automating Inequality

UK A-LEVEL ALGORITHM SCANDAL, 2020

新型コロナウイルス感染症により英国で A レベル試験が中止されたとき、政府はアルゴリズムを使用して、過去の学校成績に基づいて生徒の成績を予測しました。

The results:

• Students from disadvantaged schools were systematically downgraded
• Students from 私立学校がアップグレードされました
• アルゴリズムは、生徒が成功するだろうという教師の予測を覆しました

After massive public outcry, 成績の40％が改訂されました.

アルゴリズムがエンコードしていた historical inequality as prediction。個々の生徒の能力に関係なく、歴史的に大学に送る生徒の数が少ない学校は罰せられました。

UK Office of Qualifications and Examinations Regulation, 2020

THE LESSON

AI はバイアスを大規模に自動化できます。過去のデータが体系的な不平等を反映している場合、そのデータに基づいてトレーニングされたアルゴリズムがそのデータを永続させ、増幅させます。

研究における AI の倫理的枠組み

Questions to Ask

1 Transparency: Can I fully disclose how AI was used?

2 Accountability: AI エラーの責任は誰にありますか?

3 Equity: Does AI access create research inequities?

4 Labor: このツールを有効にしたのは誰の作業ですか?

5 Environment: What is the carbon cost of model training?

6 Reproducibility: Can others replicate my AI-assisted work?

Authorship and AI

ICMJE POSITION

AI tools cannot be listed as authors.

Authors must take responsibility for AI-generated content.

AI use must be disclosed in methods or acknowledgments.

YOUR RESPONSIBILITY

AI が幻覚を起こし、それを公開した場合、 責任はあなたにあります— OpenAI でも、Anthropic でも、ツールでもありません。

「このマシンには何もありません」良心。
データが真実かどうかは気にしません。
それを訓練するために誰が傷つけられたのかは知りません。
あなたには良心が欠けているのです。」

今後の道

証拠合成における AI の方向性

Emerging Capabilities

Multimodal AI

Extract from
figures/tables

2024-2025

Agent Systems

Multi-step
workflows

Emerging

RAG Systems

Retrieval-augmented
generation

Active research

Fine-tuned Models

MA-specific
training

In development

今後の予定変更

Enduring Human Requirements

★ 研究課題の定義 (臨床的判断)

★ Interpreting clinical significance (domain expertise)

★ Assessing applicability (contextual knowledge)

★ Making recommendations (value judgments)

★ Taking responsibility (ethical accountability)

THE CONSTANT

AI はメカニズムを加速します。
科学は人間のままです。

未来に備える

Skills to Develop

Future-Ready Researcher

↓

Prompt engineeringGetting good AI outputs

Validation methodsKnowing when AI errs

Core methodsAI cannot replace

↓

最高の AI ユーザーが最高です方法論者Understanding enables oversight

"The machine grows stronger each year.
しかし、質問は同じです:
What is true? What helps patients?
AI は検索を支援できます。
答えを提供できるのはあなただけです。"

知識をテストしてください

データ抽出に LLM を使用する主な制限は何ですか?

遅すぎます

They can generate plausible but incorrect data (hallucinations)

They cannot read PDFs

高価です

When using AI screening (e.g., ASReview), what must you always do?

Trust the AI completely after training

Screen only the top 10% of ranked records

ランダムサンプルで停止ルールを検証します

複数の AI ツールを同時に使用します

どのタスクについて AI が最終的な意思決定者になるべきではありませんか?

Deduplication

Screening prioritization

結果の臨床解釈

Reference formatting

References

Key Sources

Van de Schoot R et al. Nat Mach Intell. 2021;3:125-133. [ASReview]
Marshall IJ et al. Nat Mach Intell. 2019;1:115-117. [RobotReviewer]
Guo Y et al. J Clin Epidemiol. 2024;165:111203. [GPT-4 extraction]
Mata v. Avianca, 22-cv-1461 (S.D.N.Y. 2023). [Hallucination case]
Perrigo B. Time Magazine. 2023 Jan 18. [AI labor ethics]
Elliott JH et al. J Clin Epidemiol. 2017;91:23-30. [Living reviews]
Cochrane Handbook 2023. Chapter on automation.
ICMJE. Recommendations on AI authorship. 2023.
Rethlefsen ML et al. J Med Libr Assoc. 2021. [PRISMA-S]
Wang S et al. Syst Rev. 2023;12:178. [AI screening validation]

✔

Course Complete

「これでシリコンのことがわかりました」筆記者—
its powers and its limits.
置き換えるのではなく、加速するために使用します。
Validate what it produces.
自分が行ったことを文書化します。
常に覚えておいてください。
機械は次の単語を予測します。
その単語が正しいかどうかを判断する必要があります。 true."

ASReview: Step-by-Step Tutorial

インストールから停止決定まで

Step 1: Installation

# Option A: Python pip (recommended)
pip install asreview

# オプション B: デスクトップのダウンロードapp
# https://asreview.nl/download/

# Launch ASReview LAB
asreview lab

REQUIREMENTS

• Python 3.8+ (pip インストール用)
• OR: Windows/Mac desktop app (no Python needed)
• Your references in RIS, CSV, or EndNote XML format

Step 2: Create Project & Import

Project Setup Workflow

New Project

↓

プロジェクトに名前を付けますDescriptive, include date

↓

Import referencesRIS/CSV/XML file

↓

ASReview deduplicatesCheck count matches expected

↓

事前知識の準備

Step 3: Add Prior Knowledge

CRITICAL STEP

モデルは最初の決定から学習します。
You need 関連性と無関係性の両方 examples.

Prior Knowledge Strategy

1 Add 5-10 known relevant 研究 (スコープ検索から)

2 Search for clearly irrelevant topics (random sample)

3 Mark 10-20 irrelevant as negative examples

4 Aim for ~1:2 ratio (relevant:irrelevant) to start

WARNING

Poor prior knowledge = poor model performance.
Garbage in, garbage out.

Step 4: Screen with Active Learning

Screening Loop

ASReview presents record

↓

Your decision

Relevant全文に含める

IrrelevantExclude

↓

Model updatesRe-ranks remaining

↓

Next most likely relevantRepeat until stopping rule

Step 5: Stopping Decision

Stopping Rules Compared

Consecutive irrelevant (50-200) Common, but no recall guarantee

% of total screened (e.g., 50%) Predictable effort, variable recall

All records screened 100% recall, no time savings

Statistical stopping (Busfelder) Evidence-based, requires plugin

VALIDATION REQUIREMENT

After stopping: manually screen random sample of unscreened records.
Report estimated recall with confidence interval.

「ツールはシンプルですが、意思決定はシンプルではありません。
Feed it good examples. Check when you stop.
プロジェクトファイルをエクスポートします。これは監査証跡です。」

プロンプトエンジニアリングライブラリ

Validated prompts for meta-analysis tasks

Prompt Principles

信頼性の高い LLM 出力を実現

1 Be specific: Define exact fields and formats

2 Provide examples: Show expected output format

3 Request uncertainty: 「NR」または「UNCLEAR」フラグを要求する

4 Demand quotes: Require source text for verification

5 Limit scope: One task per prompt, not everything at once

プロンプト 1: RCT データの抽出

この RCT から以下を抽出します。各フィールドに次の情報を入力します。
- The value
- 論文からの正確な引用 (引用符内)
- 報告されていない場合は「NR」、曖昧な場合は「UNCLEAR」

FIELDS:
1. Intervention group sample size (ITT): [n]
2. Control group sample size (ITT): [n]
3. Primary outcome definition: [text]
4. Primary outcome: intervention events/total: [x/n]
5. Primary outcome: control events/total: [x/n]
6. Risk ratio (95% CI): [RR (lower, upper)]
7. Follow-up duration: [weeks/months]

OUTPUT FORMAT: 各フィールドの「値」と「引用符」を含む JSON

プロンプト 2: 研究の特徴

研究の特徴を抽出します。確認のために正確な見積もりを提供します。

FIELDS:
1. Study design: [RCT / Cluster RCT / Crossover / Other]
2. Country/countries: [list]
3. Setting: [hospital / primary care / community / other]
4. Recruitment period: [start date - end date]
5. Funding source: [text]
6. Trial registration: [ID number or "NR"]
7. Conflicts of interest declared: [Yes/No/NR]

If information is in supplementary materials, note "See Supplement".
If truly not reported anywhere, mark "NR".

Prompt 3: Population Characteristics

Extract baseline population characteristics.
介入グループと制御グループを別々にレポートします。

FIELDS (per group):
1. N randomized: [n]
2. N analyzed: [n]
3. Age: [mean (SD) or median (IQR)]
4. Sex (% female): [%]
5. Key inclusion criteria: [text]
6. Key exclusion criteria: [text]
7. Disease severity at baseline: [measure and value]

NOTE: If groups combined only, report combined with note.

Prompt 4: Risk of Bias Screening

NOTE: これは PRELIMINARY フラグ設定のみを目的としています。
Human assessment required for final judgment.

For each RoB 2 domain, identify relevant text:

D1 Randomization:
- シーケンス生成方法：[引用符またはNR]
- 割り当て隠蔽方法：[引用またはNR]

D2 Deviations:
- Blinding of participants: [quote or NR]
- Blinding of personnel: [quote or NR]

D3 Missing data:
- Attrition rates: [intervention: x%, control: y%]
・欠落データの扱い：[見積書またはNR]

DO NOT make judgments. Only extract quotes.

「プロンプトは、マシンとの契約です。
質問する内容は正確にしてください。
すべての答えに対して証拠を要求します。
Verify every output against the source."

体系的なレビューを書くことは決してないかもしれません。
しかし、あなたはそうします read them.

AI 支援かどうかはどうやってわかりますか
was done well or poorly?

The IBM Watson Oncology Failure

MD ANDERSON CANCER CENTER, 2017

IBM Watson for Oncology は、がん治療を推奨するために訓練されました。

After spending $62 million, MDアンダーソンはプロジェクトをキャンセルした。

Internal documents showed Watson made 「安全ではなく間違っている」 治療の推奨事項。実際の患者データではなく、合成症例に基づいてトレーニングされました。

AIは自信を持って見えた。推奨事項は危険でした。

Lesson: AI confidence ≠ AI correctness

STAT News investigation, 2017; IEEE Spectrum 2019

AI 支援レビューに関する質問

メソッドで何を探すべきか

1 Did they AIツールに名前を付けます used? (version, date)

2 Did they specify which tasks were AI-assisted?

3 Did they validate AI outputs? How?

4 AIスクリーニングの場合： stopping rule? What estimated recall?

5 AI 抽出の場合: 100% human verified?

6 Was there human oversight of all AI decisions?

Red Flags in AI-Assisted Reviews

Warning Signs

"AI screened all titles" No human involvement?

「GPT抽出データ」 No verification mentioned?

"Stopped after 500 consecutive irrelevant" No recall estimate?

"AI-generated protocol" Human decisions unclear?

No AI tools mentioned but clearly AI-written Hidden AI use

患者と臨床医向け

知っておくべきこと

Good AI use: Speeds up the work, human verifies
Bad AI use: Replaces human judgment, no validation

An AI-assisted review can be trustworthy—if done right.

Simple Questions to Ask

? 「このレビューでは AI が使用されましたか?」

? 「AI の結果は人間によってチェックされましたか?」

? "Could AI have missed important studies?"

"AI assistance is not a flaw—it is often an advantage.
But only if validated, only if disclosed.
質問: 機械はチェックされましたか?
答えが不明瞭な場合は、 "

研究者のことを考えていませんか
with unstable internet, limited compute,
no institutional subscription,
who still needs to synthesize evidence?

無料&オフライン対応ツール

ASReview

Desktop app
Works offline

FREE

Abstrackr

Web-based
Free accounts

FREE

Rayyan

Free tier
Limited AI

FREEMIUM

RevMan

Cochrane tool
Full MA software

FREE

Offline Workflow

When Internet is Unreliable

Search Phase

↓

ライブラリ/カフェ: すべての PDF をダウンロード接続時に一括ダウンロード

↓

Screening Phase

↓

ASReview desktopWorks fully offline

↓

Extraction Phase

↓

Spreadsheet + local PDFsNo AI needed

Low-Cost LLM Alternatives

WHEN API COSTS ARE PROHIBITIVE

• Claude/ChatGPT free tiers: Limited but functional
• Ollama + local models: Free, runs on laptop (requires download)
• Hugging Face inference: Free tier available
• Manual extraction: Still gold standard, just slower

HONEST ASSESSMENT

AI は必須ではなく便利です。
All Cochrane reviews were done without AI.
品質はツールではなく方法から生まれます。

Resource-Limited Decision Tree

選択あなたのアプローチ

Your Resources

↓

Internet reliability?

Stable

Web tools OKRayyan, Covidence

Unreliable

Desktop toolsASReview offline

None

Manual + spreadsheetsStill valid

「証拠はすべての人のものです。
高速インターネットと有料サブスクリプションを持っている人だけではありません。
ツールは異なる場合があります。メソッドは残ります。
Quality synthesis is possible anywhere."

Validation Calculations

AI 検証のサンプルサイズ

Estimating Recall After AI Screening

THE PROBLEM

でスクリーニングを停止しました。 5000 件のレコードのうち 1000 件。
関連する研究がすべて見つかったとどの程度自信がありますか?

Validation Sampling

Unscreened records (n=4000)

↓

Random sample (n=400)10% or at least 200

↓

Manual screening

0 relevant foundRecall ≈ 95-100%

Relevant foundScreen all remaining

Sample Size Formula

再現率 95% の信頼性

                    n = ln(1 - confidence) / ln(1 - prevalence)

Example:

                    If prevalence of relevant = 1% (0.01)

                    For 95% confidence (0.95):

                    n = ln(1 - 0.95) / ln(1 - 0.01)

                    n = ln(0.05) / ln(0.99)

                    n ≈ 299 records to sample

Quick Reference Table

検証用のサンプルサイズ

Prevalence 0.5%, 95% conf 598 records

Prevalence 1%, 95% conf 299 records

Prevalence 2%, 95% conf 149 records

Prevalence 5%, 95% conf 59 records

Practical minimum 200 records (conservative)

レポート検証

メソッドの例のテキスト:

「タイトル/要約のスクリーニングに ASReview LAB (v1.2) を使用しました。
active learning. Screening ceased after 150 consecutive
irrelevant records, having screened 1,247 of 4,892 records
(25%). To validate recall, we manually screened a random
sample of 300 unscreened records. No additional relevant
研究結果が特定され、推定再現率 ≥95% が示唆されました
(binomial 95% CI: 91-100%)."

」検証はオプションではありません。
Calculate your sample. Screen it manually.
発見したことを報告してください。あなたが見逃していたかもしれないことを認めてください。」

When to Trust AI in Meta-Analysis

Honest Assessment

ASReview Workflow

AI スクリーニングを使用する必要がありますか?

Safe LLM Extraction Protocol

LLM Extraction Value Assessment

When to Use Automated RoB

What Machines Cannot Assess

Best Practice Protocol

LLM Use in Protocol Development

Quality Assurance Steps

Automated Surveillance Stack

「リビング」のレビューを行う時期

Verification Strategy by Risk

Minimum Quality Standards

Integrated Human-AI Process

Questions to Ask

Enduring Human Requirements

Skills to Develop

Key Sources

Project Setup Workflow

Prior Knowledge Strategy

Screening Loop

Stopping Rules Compared

信頼性の高い LLM 出力を実現

メソッドで何を探すべきか

Warning Signs

Simple Questions to Ask

When Internet is Unreliable

選択あなたのアプローチ

Validation Sampling

検証用のサンプル サイズ

検証用のサンプルサイズ