The Silicon Scribe: IA en metaanálisis

¿No has oído hablar de la máquina que lee
ten thousand abstracts in an hour,
que extrae datos mientras duerme,
that promises to libérelo del trabajo pesado?

La revolución de la IA en la síntesis de evidencia

67%

Workload reduction
with AI screening

95%

Recall achievable
con aprendizaje activo

10x

Faster screening
than manual

THE PROMISE

La IA puede examinar resúmenes, extraer datos, evaluar el riesgo de sesgo y monitorear nuevos evidencia—if used correctly.

When AI Fails in Healthcare

IBM WATSON ONCOLOGY, MD ANDERSON, 2013-2017

En 2013, el MD Anderson Cancer Center se asoció con IBM Watson para revolucionar las recomendaciones de tratamiento del cáncer. El costo del proyecto $62 million.

Para 2017, el proyecto fue abandonado. Se descubrió que las recomendaciones de Watson eran "inseguras e incorrectas" in multiple cases.

In one documented case, Watson recommended a treatment that would cause severe bleeding in a patient already on blood thinners.

The core problem: Watson had been trained primarily on hypothetical cases created by physicians, no datos reales de pacientes. La IA aprendió a imitar las opiniones de expertos en lugar de aprender de los resultados reales.

Stat News, 2017; IEEE Spectrum, 2019

THE LESSON

La IA entrenada con datos sintéticos o hipotéticos falla en pacientes reales. La brecha entre los datos de entrenamiento y la realidad puede ser letal.

El problema de las alucinaciones

LAWYERS SANCTIONED, NEW YORK, 2023

Attorneys used ChatGPT to research case law for a federal court brief.

La IA citó seis casos con citas completas, citas y números de página.

Ninguno de los casos existió.

El juez encontró que las citaciones eran "galimatías" y sancionó a los abogados.

Esto no es un error. Así es como funcionan los grandes modelos de lenguaje: predicen texto plausible, no verdad verificada.

Mata v. Avianca, Inc., 22-cv-1461 (S.D.N.Y. 2023)

La pregunta central

When to Trust AI in Meta-Analysis

AI Tool Output

↓

Task Type?

Ranking/Prioritization

Lower riskHuman reviews top-ranked

Binary Decision

Medium riskNeeds validation

Text Generation

High riskHallucination possible

Qué puede y no puede hacer la IA

Honest Assessment

Screening prioritization ✓ Excellent

Duplicate detection ✓ Excellent

Extracción de datos (estructurada) ⚠ Needs verification

Risk of bias assessment ⚠ Preliminary only

Escritura protocolo/métodos ⚠ Draft only

Statistical analysis ✗ Human required

Clinical interpretation ✗ Human required

"La máquina lee rápido pero no comprende.
Predice la siguiente palabra, no la verdad.
Úselo para acelerar, no para reemplazar.
The judgment must remain yours."

¿No has visto al revisor
who screened ten thousand titles by hand,
whose eyes grew tired, whose attention wandered,
que se perdió? el un estudio que importó?

Las herramientas de detección

ASReview

Active learning
Open source

Free

Rayyan

AI recommendations
Collaboration

Freemium

Abstrackr

Semi-automated
Web-based

Free

EPPI-Reviewer

Priority screening
Full workflow

Subscription

How Active Learning Works

ASReview Workflow

Import References

↓

Screen seed papers10-20 known relevant

↓

AI learns patternsActualizaciones con cada decisión

↓

Prioritizes likely relevantMost promising first

↓

Stopping rule?

Consecutive irrelevante.g., 100-200 in row

% screenedpor ejemplo, 50% con verificación de recuperación

Rendimiento real Datos

VAN DE SCHOOT ET AL., 2021

Systematic evaluation of ASReview across 4 datasets:

• PTSD dataset: 95% recall after screening 40% of records
• Software fault prediction: 95% recall after 20%
• Virus metagenomics: 95% recall after 10%

Average workload reduction: 67-95% depending on prevalence.

But: Performance varies by topic and prevalence. Low-prevalence topics show greater efficiency gains.

Van de Schoot R et al. Nat Mach Intell. 2021;3:125-133

When AI-Assisted Screening Works

ASREVIEW AND COCHRANE COVID-19 RESPONSE, 2020

During the COVID-19 pandemic, Cochrane needed to screen 50,000+ citations weekly to keep reviews current.

El sistema de aprendizaje activo de ASReview se implementó con rigurosa supervisión humana:

• Reduced human screening workload by 75%
• Missed fewer than 1% of relevant studies
• Validated at every stage by human reviewers

La clave del éxito: human-in-the-loop validation at every stage. La IA priorizó, pero los humanos tomaron decisiones finales y verificaron muestras de registros excluidos de la IA.

Cochrane COVID-NMA consortium, 2020-2021

THE LESSON

La IA aumenta el juicio humano; no lo reemplaza. El éxito proviene de la asociación, no de la automatización.

When Internal Validation Fails

EPIC SEPSIS MODEL, JAMA INTERNAL MEDICINE, 2021

Epic Systems deployed a sepsis prediction algorithm to hundreds of hospitals en todo Estados Unidos.

Epic's internal validation showed excellent performance. Hospitals trusted it.

Luego vino el estudio de validación externa en JAMA Internal Medicine:

• The model missed 67% of sepsis cases
• It triggered thousands of false alarms
• Nurses developed severe "alert fatigue"

El modelo había sido validado con datos históricos del mismo sistema; nunca se había probado en el entorno clínico real donde se aplicaría. implementado.

Wong A et al. JAMA Intern Med. 2021;181(8):1065-1070

THE LESSON

La validación interna no es una validación externa. Un modelo que funciona en desarrollo puede fallar en su implementación. Valide siempre en el contexto del mundo real.

El problema de detenerse

EL PELIGRO OCULTO

¿Cuándo se detiene la detección con aprendizaje activo?

Si se detiene demasiado pronto: Usted perder estudios relevantes
Si te detienes demasiado tarde: Pierdes ganancias de eficiencia

El algoritmo no puede decirte cuándo has encontrado todo. Solo clasifica lo que queda.

There is no perfect stopping rule. Every rule trades recall for efficiency.

CRITICAL POINT

You must valide su regla de detención by manually checking a random sample of unscreened records.

AI Screening Decision Tree

¿Debería utilizar la detección de IA?

Large Reference Set?

↓

<500 refs

Manual OKLa sobrecarga de IA no vale la pena

500-2000 refs

AI helpfulModerate efficiency gain

>2000 refs

AI essentialMajor time savings

↓

Always validate with random sampleReport methodology in paper

"La máquina encuentra las agujas más rápido,
but it cannot guarantee none remain in the haystack.
Confía en el ranking, verifica las paradas,
y siempre informa lo que hiciste."

¿No has soñado con el asistente?
who reads every paper and fills every cell,
who never tires, never errs,
who extracts perfectly?

Ese asistente no existe.

El problema de la precisión de la extracción

EXTRACCIÓN DE DATOS GPT-4 ESTUDIO, 2024

Los investigadores probaron GPT-4 para extraer datos de 100 artículos RCT.

Results:
• Sample sizes: 89% accurate
• Effect estimates: 76% accurate
• Confidence intervals: 71% accurate
• Risk of bias judgments: 62% agreement with humans

A 24% error rate En efecto, las estimaciones significan que aproximadamente 1 de cada 4 estudios tendría datos incorrectos en su metanálisis.

Guo Y et al. J Clin Epidemiol. 2024;165:111203

La fabricación Problema

GPT-4 HALLUCINATIONS IN SYSTEMATIC REVIEWS, 2023

Los investigadores probaron GPT-4 para la extracción de datos de artículos de revisión sistemática. Al modelo se le entregaron archivos PDF y se le pidió que extrajera tamaños de muestra, valores p y estimaciones de efectos.

GPT-4 confidently provided all requested numbers with precise formatting.

But El 23% de las extracciones fueron "alucinaciones": números sin base en el texto fuente.

In one case, the model fabricated a statistically significant result (p=0.003) de un estudio que realmente encontró no significant effect (p=0.42).

El La confianza del modelo era indistinguible entre datos reales y fabricados.

Revisión sistemática de estudios de validación de IA, 2023

THE LESSON

Los LLM requieren una verificación humana del 100 % para los datos cuantitativos. No hay atajos. Cada número debe compararse con la fuente.

Flujo de trabajo de extracción de datos de LLM

Safe LLM Extraction Protocol

PDF/Full Text

↓

LLM extrae datosStructured prompt

↓

Human verifies 100%NOT sampling

↓

Discrepancy?

Yes

Human value usedDocument error

No

ProceedLog verification

Ingeniería rápida para la extracción

# Example extraction prompt

Extract lo siguiente de este RCT:

1. Sample size (intervention arm): [number]
2. Sample size (control arm): [number]
3. Primary outcome definition: [text]
4. Effect estimate: [number with unit]
5. 95% CI: [lower, upper]
6. p-value: [number]

If not reported, write "NR"
If unclear, write "UNCLEAR: [reason]"

# Provide exact quotes for verification

When LLMs Help vs. Hurt

LLM Extraction Value Assessment

Standardized fields (author, year) ✓ High accuracy

Simple numeric (sample size) ✓ Usually reliable

Complex numeric (adjusted OR) ⚠ Often wrong model

Composite outcomes ⚠ Misses components

Intention-to-treat vs per-protocol ✗ Frequently confused

Subgroup data ✗ High error rate

"The LLM extracts plausible numbers,
no necesariamente números correctos.
Es un primer borrador rápido, no una respuesta final.
Every cell must be verified by human eyes."

¿No has deseado un juez
who reads every methods section,
who assesses bias without bias,
que nunca esté en desacuerdo? con themselves?

RobotReviewer

MARSHALL ET AL., NATURE MACHINE INTELLIGENCE, 2019

RobotReviewer uses machine learning to assess risk of bias in RCTs.

Validation against Cochrane assessments:
• Random sequence generation: 71% agreement
• Allocation concealment: 65% agreement
• Blinding of participants: 69% agreement
• Blinding of outcome assessment: 62% agreement

Human inter-rater agreement is typically 70-80%.

RobotReviewer approaches but does not exceed human performance.

Marshall IJ et al. Nat Mach Intell. 2019;1:115-117

RoB Automation Decision Tree

When to Use Automated RoB

Risk of Bias Assessment

↓

Review Type?

Rapid review

Automated OKAcknowledge limitation

Scoping review

Automated OKIf RoB included

Revisión sistemática completa

Preliminary onlyHuman verification required

Cochrane review

Human requiredDraft support only

Limitations of Automated RoB

What Machines Cannot Assess

✗ Outcome-specific bias (RoB 2 domain 4)

✗ Selective reporting based on protocol comparison

✗ Contextual judgment (Is this design appropriate?)

✗ Cross-paper inconsistencies (multiple reports)

✗ Influencia de la financiación en la interpretación de resultados

EL LÍMITE FUNDAMENTAL

AI reads what is written. Bias assessment often requires judging what is not written.

Flujo de trabajo híbrido para RoB

Best Practice Protocol

Full Text PDFs

↓

RobotReviewer screeningFlags potential issues

↓

Reviewer 1 assessesUsing AI output as reference

↓

Reviewer 2 independentlyBlinded to AI output

↓

Consensus meeting

↓

Final assessmentHuman decision documented

"El robot lee los métodos sección
but cannot read between the lines.
Úselo para marcar, no para juzgar.
El veredicto debe ser humano."

¿No has deseado al escritor
who drafts your protocol in minutes,
who knows every PRISMA item,
who writes in perfect academic prose?

LLM para redacción de protocolos

✓

Structure
generation

✓

Boilerplate
text

⚠

PICO
formulation

✗

Search
strategy

LA PROPUESTA DE VALOR

LLM puede redactar la estructura y lenguaje estándar. Debe proporcionar el scientific decisions.

El peligro de la estrategia de búsqueda

TESTED ACROSS MULTIPLE LLMs, 2023-2024

Researchers asked GPT-4 and Claude to generate MEDLINE search strategies.

Common errors:
• Invented MeSH terms that don't exist
• Wrong field codes (e.g., [tiab] vs [tw])
• Conceptos clave que faltan en la pregunta de investigación
• Overly narrow strategies missing relevant studies
• Syntax errors that wouldn't execute

An information specialist must write or validate all search strategies.

Múltiples estudios de validación 2023-2024

Protocol Writing Decision Tree

LLM Use in Protocol Development

Protocol Section

↓

Background/Rationale

LLM helpfulDraft + fact-check

Methods structure

LLM helpfulTemplate generation

PICO criteria

Human decidesLLM refines wording

Search strategy

Human/SpecialistAI too unreliable

Safe LLM Protocol Workflow

Quality Assurance Steps

1 Define PICO yourself (human scientific decision)

2 Ask LLM to draft protocol sections

3 Verify all cited guidelines exist (PRISMA, Cochrane)

4 Write search strategy with information specialist

5 Check all methodological decisions are defensible

6 Disclose AI assistance in protocol

7 Registrar la verificación humana versión

"La máquina puede escribir las palabras,
but it cannot make the decisions.
Tú defines la pregunta. Tú eliges los métodos.
El protocolo es tuyo: AI es el mecanógrafo."

¿No has visto la revisión sistemática
que estaba desactualizada antes de su publicación,
while new trials accumulated in the literature,
unsynthesized, unknown?

The Living Review Problema

TSUNAMI DE EVIDENCIA DE COVID-19, 2020

En el primer año de la pandemia:

• 100,000+ COVID papers published
• Traditional reviews obsolete within weeks
• Clinicians made decisions on incomplete evidence

El consorcio COVID-NMA utilizó AI-assisted surveillance to monitor new trials daily and update meta-analyses weekly.

Esto requirió: monitoreo de búsqueda automatizado, priorización de detección de IA, flujos de trabajo rápidos de extracción de datos y Actualizaciones estadísticas continuas.

Defined in Cochrane Living Reviews guidance

Componentes de IA para revisiones en vivo

Automated Surveillance Stack

Sistema de revisión en vivo

↓

Auto-searchDaily/weekly runs

AI triagePriority screening

Rapid extractionLLM-assisted

Auto-updateCumulative MA

↓

Human oversight at each stageRevisión editorial antes de la publicación

Herramientas para monitoreo continuo

PubMed Alerts

Free email alerts
Saved searches

Basic

Epistemonikos

Systematic review
database

AI-curated

Covidence

Auto-import
Living mode

Subscription

DistillerSR

AI screening
+ monitoring

Enterprise

Decisión de revisión en vivo Marco

Cuándo hacer una revisión de "Vivir"

¿Debería ser esto vivir?

↓

Criteria Check

Priority questionClinical importance

Evidence evolvingActive trial pipeline

Resources securedFinanciamiento para más de 2 años

↓

All three required for living status

"La máquina observa la literatura
mientras duermes.
But someone must wake to judge
si la nueva evidencia cambia la verdad."

Si usas la máquina sin verificación,
no sabes qué errores has cometido.

Si verificas todo lo que produce la máquina,
what time have you saved?

La respuesta está en strategic verification.

La paradoja de la verificación

THE DILEMMA

Full verification = No time savings
No verification = Unknown error rate
Strategic verification = Validated efficiency

Verification Strategy by Risk

High-risk tasks

100% human reviewExtracción de datos, RoB

Medium-risk tasks

Sample validationScreening decisions

Low-risk tasks

Spot checksDeduplication

When Oversight Catches Bias

COCHRANE MACHINE LEARNING PILOT, 2022

Cochrane tested ML-assisted risk of bias assessment to accelerate systematic reviews.

El algoritmo logró 85% de acuerdo con revisores humanos—seemingly impressive.

Pero el equipo de control de calidad analizó el 15% de desacuerdos y encontró un patrón:

The AI was systematically biased toward rating industry-funded trials as low risk.

Los datos de capacitación contenían más etiquetas de "bajo riesgo" para ensayos de compañías farmacéuticas; el algoritmo aprendió esta correlación sin comprender las preocupaciones metodológicas subyacentes.

Human oversight caught the pattern before any biased reviews were published.

Estudio piloto del Grupo Cochrane de Métodos, 2022

THE LESSON

El análisis de desacuerdos revela un sesgo sistemático. Una alta precisión general puede ocultar patrones peligrosos. Analice siempre dónde y cómo falla la IA, no solo con qué frecuencia.

Marco de control de calidad para revisiones asistidas por IA

Minimum Quality Standards

1 Pre-specify AI use in protocol (which tools, which tasks)

2 Document AI settings (model version, prompts, parameters)

3 Validate screening with random sample (calculate recall estimate)

4 Verifique todos los datos extraídos against source documents

5 Human RoB assessment (AI as preliminary only)

6 Track error rates per AI task

7 Report transparently in methods section

Reporting AI Use (PRISMA-S)

QUÉ INFORMAR EN SU DOCUMENTO

• Which AI tools were used (name, version, date)
• Which tasks were AI-assisted
• What validation was performed
• What error rates were observed
• What human oversight was maintained
• Any deviations del protocolo debido a la IA limitaciones

EMERGING STANDARD

Journals increasingly require AI use statements. PRISMA-S extension for search reporting includes automation.

El flujo de trabajo completo de AI-MA

Integrated Human-AI Process

Protocol (Human + LLM draft)

↓

Search (Human/Specialist)

↓

Screening (AI prioritize + Human decide)

↓

Extraction (LLM draft + Human verify 100%)

↓

RoB (AI flag + Human assess)

↓

Analysis (Human)

↓

Interpretation (Human)

"The machine is neither colleague nor replacement.
Es una herramienta poderosa, rápida y falible.
Document what you used. Validate what it produced.
La responsabilidad sigue siendo suya."

¿No has considerado
whose labor trained the model,
whose data it consumed without consent,
whose jobs it may displace?

El trabajo oculto

KENYAN DATA LABELERS, TIME MAGAZINE 2023

ChatGPT se hizo "seguro" a través de un proceso llamado RLHF—Aprendizaje por refuerzo de humanos Comentarios.

Los humanos que proporcionaron esos comentarios eran trabajadores en Kenia, a los que se les pagaba less than $2 per hour para leer y etiquetar contenido tóxico, violento y perturbador.

Desarrollaron un trauma psicológico a causa del trabajo.

Cada herramienta de IA que utilizas depende del trabajo humano, a menudo invisible, a menudo mal pagado, a menudo perjudicado.

Perrigo B. Time Magazine. 2023 Jan 18.

Automating Inequality

UK A-LEVEL ALGORITHM SCANDAL, 2020

Cuando COVID-19 canceló los exámenes A-Level en el Reino Unido, el gobierno usó un algoritmo para predecir las calificaciones de los estudiantes basándose en el rendimiento escolar histórico.

The results:

• Students from disadvantaged schools were systematically downgraded
• Students from las escuelas privadas fueron mejoradas
• El algoritmo anuló las predicciones de los maestros de que los estudiantes tener éxito

After massive public outcry, Se revisó el 40% de las calificaciones.

El algoritmo había codificado historical inequality as predictionLas escuelas que históricamente enviaron menos estudiantes a la universidad fueron penalizadas, independientemente de la capacidad individual de los estudiantes.

UK Office of Qualifications and Examinations Regulation, 2020

THE LESSON

La IA puede automatizar el sesgo a escala. Cuando los datos históricos reflejan desigualdad sistémica, los algoritmos. capacitados en esos datos los perpetúan y amplifican.

Marco ético para la IA en la investigación

Questions to Ask

1 Transparency: Can I fully disclose how AI was used?

2 Accountability: ¿Quién es responsable de los errores de la IA?

3 Equity: Does AI access create research inequities?

4 Labor: ¿De quién es el trabajo que permitió esta herramienta?

5 Environment: What is the carbon cost of model training?

6 Reproducibility: Can others replicate my AI-assisted work?

Authorship and AI

ICMJE POSITION

AI tools cannot be listed as authors.

Authors must take responsibility for AI-generated content.

AI use must be disclosed in methods or acknowledgments.

YOUR RESPONSIBILITY

Si la IA alucina y tú la publicas, tú eres el responsable—no OpenAI, no Anthropic, no la herramienta.

"La máquina no tiene conciencia.
No le importa si los datos son verdaderos.
No sabe quién fue lastimado para entrenarlo.
Tú debes ser la conciencia que carece."

El camino por delante

Dónde está la IA en la síntesis de evidencia Ir

Emerging Capabilities

Multimodal AI

Extract from
figures/tables

2024-2025

Agent Systems

Multi-step
workflows

Emerging

RAG Systems

Retrieval-augmented
generation

Active research

Fine-tuned Models

MA-specific
training

In development

Lo que NO cambiará

Enduring Human Requirements

★ Definir la pregunta de investigación (juicio clínico)

★ Interpreting clinical significance (domain expertise)

★ Assessing applicability (contextual knowledge)

★ Making recommendations (value judgments)

★ Taking responsibility (ethical accountability)

THE CONSTANT

La IA acelerará la mecánica.
La ciencia sigue siendo humana.

Prepararse para el Futuro

Skills to Develop

Future-Ready Researcher

↓

Prompt engineeringGetting good AI outputs

Validation methodsKnowing when AI errs

Core methodsAI cannot replace

↓

Los mejores usuarios de IA son los mejores metodólogosUnderstanding enables oversight

"The machine grows stronger each year.
Pero la pregunta sigue siendo la misma:
What is true? What helps patients?
La IA puede ayudar en la búsqueda.
Solo tú puedes proporcionar la respuesta."

Pon a prueba tus conocimientos

¿Cuál es la principal limitación de uso? ¿LLM para extracción de datos?

Son demasiado lentos

They can generate plausible but incorrect data (hallucinations)

They cannot read PDFs

Son demasiado caros

When using AI screening (e.g., ASReview), what must you always do?

Trust the AI completely after training

Screen only the top 10% of ranked records

Validar la regla de detención con una muestra aleatoria

Utilizar múltiples herramientas de IA simultáneamente

Para qué tarea NUNCA debe estar la IA ¿Quién toma la decisión final?

Deduplication

Screening prioritization

Interpretación clínica de los resultados

Reference formatting

References

Key Sources

Van de Schoot R et al. Nat Mach Intell. 2021;3:125-133. [ASReview]
Marshall IJ et al. Nat Mach Intell. 2019;1:115-117. [RobotReviewer]
Guo Y et al. J Clin Epidemiol. 2024;165:111203. [GPT-4 extraction]
Mata v. Avianca, 22-cv-1461 (S.D.N.Y. 2023). [Hallucination case]
Perrigo B. Time Magazine. 2023 Jan 18. [AI labor ethics]
Elliott JH et al. J Clin Epidemiol. 2017;91:23-30. [Living reviews]
Cochrane Handbook 2023. Chapter on automation.
ICMJE. Recommendations on AI authorship. 2023.
Rethlefsen ML et al. J Med Libr Assoc. 2021. [PRISMA-S]
Wang S et al. Syst Rev. 2023;12:178. [AI screening validation]

✔

Course Complete

"Ahora conoce Silicon Scribe:
its powers and its limits.
Úselo para acelerar, no para reemplazar.
Validate what it produces.
Documente lo que hizo.
Y recuerde siempre:
La máquina predice el siguiente palabra.
Debes juzgar si esa palabra es verdadera."

ASReview: Step-by-Step Tutorial

Desde la instalación para detener la decisión

Step 1: Installation

# Option A: Python pip (recommended)
pip install asreview

# Opción B: Descargar la aplicación de escritorio
# https://asreview.nl/download/

# Launch ASReview LAB
asreview lab

REQUIREMENTS

• Python 3.8+ (para instalación de pip)
• OR: Windows/Mac desktop app (no Python needed)
• Your references in RIS, CSV, or EndNote XML format

Step 2: Create Project & Import

Project Setup Workflow

New Project

↓

Ponle nombre a tu proyectoDescriptive, include date

↓

Import referencesRIS/CSV/XML file

↓

ASReview deduplicatesCheck count matches expected

↓

Listo para conocimientos previos

Step 3: Add Prior Knowledge

CRITICAL STEP

El modelo aprende de sus decisiones iniciales.
You need tanto relevantes como irrelevantes examples.

Prior Knowledge Strategy

1 Add 5-10 known relevant estudios (de la búsqueda de alcance)

2 Search for clearly irrelevant topics (random sample)

3 Mark 10-20 irrelevant as negative examples

4 Aim for ~1:2 ratio (relevant:irrelevant) to start

WARNING

Poor prior knowledge = poor model performance.
Garbage in, garbage out.

Step 4: Screen with Active Learning

Screening Loop

ASReview presents record

↓

Your decision

RelevantIncluir para texto completo

IrrelevantExclude

↓

Model updatesRe-ranks remaining

↓

Next most likely relevantRepeat until stopping rule

Step 5: Stopping Decision

Stopping Rules Compared

Consecutive irrelevant (50-200) Common, but no recall guarantee

% of total screened (e.g., 50%) Predictable effort, variable recall

All records screened 100% recall, no time savings

Statistical stopping (Busfelder) Evidence-based, requires plugin

VALIDATION REQUIREMENT

After stopping: manually screen random sample of unscreened records.
Report estimated recall with confidence interval.

"La herramienta es simple. Las decisiones son no.
Feed it good examples. Check when you stop.
Exporta el archivo de tu proyecto: es tu pista de auditoría."

Biblioteca de ingeniería inmediata

Validated prompts for meta-analysis tasks

Prompt Principles

Para resultados LLM confiables

1 Be specific: Define exact fields and formats

2 Provide examples: Show expected output format

3 Request uncertainty: Solicite indicadores "NR" o "UNCLEAR"

4 Demand quotes: Require source text for verification

5 Limit scope: One task per prompt, not everything at once

Pregunta 1: Extracción de datos RCT

Extraiga lo siguiente de este RCT. Para cada campo, proporcione:
- The value
- La cita exacta del artículo (entre comillas)
- "NR" si no se informa, "UNCLEAR" si es ambiguo

FIELDS:
1. Intervention group sample size (ITT): [n]
2. Control group sample size (ITT): [n]
3. Primary outcome definition: [text]
4. Primary outcome: intervention events/total: [x/n]
5. Primary outcome: control events/total: [x/n]
6. Risk ratio (95% CI): [RR (lower, upper)]
7. Follow-up duration: [weeks/months]

OUTPUT FORMAT: JSON con "valor" y "cotización" para cada campo

Pregunta 2: Características del estudio

Extraer las características del estudio. Proporcione cotizaciones exactas para la verificación.

FIELDS:
1. Study design: [RCT / Cluster RCT / Crossover / Other]
2. Country/countries: [list]
3. Setting: [hospital / primary care / community / other]
4. Recruitment period: [start date - end date]
5. Funding source: [text]
6. Trial registration: [ID number or "NR"]
7. Conflicts of interest declared: [Yes/No/NR]

If information is in supplementary materials, note "See Supplement".
If truly not reported anywhere, mark "NR".

Prompt 3: Population Characteristics

Extract baseline population characteristics.
Informe para los grupos de INTERVENCIÓN y CONTROL por separado.

FIELDS (per group):
1. N randomized: [n]
2. N analyzed: [n]
3. Age: [mean (SD) or median (IQR)]
4. Sex (% female): [%]
5. Key inclusion criteria: [text]
6. Key exclusion criteria: [text]
7. Disease severity at baseline: [measure and value]

NOTE: If groups combined only, report combined with note.

Prompt 4: Risk of Bias Screening

NOTE: Esto es solo para marcación PRELIMINAR.
Human assessment required for final judgment.

For each RoB 2 domain, identify relevant text:

D1 Randomization:
- Método de generación de secuencia: [cotización o NR]
- Método de ocultación de asignación: [cotización o NR]

D2 Deviations:
- Blinding of participants: [quote or NR]
- Blinding of personnel: [quote or NR]

D3 Missing data:
- Attrition rates: [intervention: x%, control: y%]
- Manejo de datos faltantes: [quote o NR]

DO NOT make judgments. Only extract quotes.

"El mensaje es su contrato con la máquina.
Sea preciso en lo que pregunta.
Exija evidencia para cada respuesta.
Verify every output against the source."

Es posible que nunca escribas una revisión sistemática.
Pero tú Will read them.

¿Cómo saber si la asistencia de IA?
was done well or poorly?

The IBM Watson Oncology Failure

MD ANDERSON CANCER CENTER, 2017

IBM Watson for Oncology fue capacitado para recomendar tratamientos contra el cáncer.

After spending $62 million, MD Anderson canceló el proyecto.

Internal documents showed Watson made "inseguras e incorrectas" Recomendaciones de tratamiento. Fue capacitado en casos sintéticos, no en pacientes reales. datos.

La IA parecía segura Las recomendaciones eran peligrosas.

Lesson: AI confidence ≠ AI correctness

STAT News investigation, 2017; IEEE Spectrum 2019

Preguntas para revisiones asistidas por IA

Qué buscar en los métodos

1 Did they nombre de las herramientas de IA used? (version, date)

2 Did they specify which tasks were AI-assisted?

3 Did they validate AI outputs? How?

4 Para la detección de IA: Qué stopping rule? What estimated recall?

5 Para la extracción de IA: Era 100% human verified?

6 Was there human oversight of all AI decisions?

Red Flags in AI-Assisted Reviews

Warning Signs

"AI screened all titles" No human involvement?

"Datos extraídos de GPT" No verification mentioned?

"Stopped after 500 consecutive irrelevant" No recall estimate?

"AI-generated protocol" Human decisions unclear?

No AI tools mentioned but clearly AI-written Hidden AI use

Para pacientes y médicos

LO QUE NECESITA SABER

Good AI use: Speeds up the work, human verifies
Bad AI use: Replaces human judgment, no validation

An AI-assisted review can be trustworthy—if done right.

Simple Questions to Ask

? "¿Se utilizó IA en esta revisión?"

? "¿Los resultados de la IA fueron verificados por humanos?"

? "Could AI have missed important studies?"

"AI assistance is not a flaw—it is often an advantage.
But only if validated, only if disclosed.
Pregunte: ¿Se revisó la máquina?
Si la respuesta no está clara, también lo está el revisión."

¿No has considerado al investigador
with unstable internet, limited compute,
no institutional subscription,
who still needs to synthesize evidence?

Gratis & Herramientas compatibles sin conexión

ASReview

Desktop app
Works offline

FREE

Abstrackr

Web-based
Free accounts

FREE

Rayyan

Free tier
Limited AI

FREEMIUM

RevMan

Cochrane tool
Full MA software

FREE

Offline Workflow

When Internet is Unreliable

Search Phase

↓

Biblioteca/café: descargue todos los archivos PDFDescarga por lotes cuando esté conectado

↓

Screening Phase

↓

ASReview desktopWorks fully offline

↓

Extraction Phase

↓

Spreadsheet + local PDFsNo AI needed

Low-Cost LLM Alternatives

WHEN API COSTS ARE PROHIBITIVE

• Claude/ChatGPT free tiers: Limited but functional
• Ollama + local models: Free, runs on laptop (requires download)
• Hugging Face inference: Free tier available
• Manual extraction: Still gold standard, just slower

HONEST ASSESSMENT

La IA es una conveniencia, no una necesidad.
All Cochrane reviews were done without AI.
La calidad proviene de los métodos, no de herramientas.

Resource-Limited Decision Tree

Elegir su enfoque

Your Resources

↓

Internet reliability?

Stable

Web tools OKRayyan, Covidence

Unreliable

Desktop toolsASReview offline

None

Manual + spreadsheetsStill valid

"La evidencia pertenece a todos,
no solo a aquellos con Internet rápido y suscripciones pagas.
Las herramientas pueden diferir. Los métodos permanecen.
Quality synthesis is possible anywhere."

Validation Calculations

Tamaños de muestra para la verificación de IA

Estimating Recall After AI Screening

THE PROBLEM

Dejaste de examinar en 1000 de 5000 registros.
¿Qué tan seguro está de haber encontrado todos los estudios relevantes?

Validation Sampling

Unscreened records (n=4000)

↓

Random sample (n=400)10% or at least 200

↓

Manual screening

0 relevant foundRecall ≈ 95-100%

Relevant foundScreen all remaining

Sample Size Formula

PARA UN 95% DE CONFIANZA EN EL RECUERDO

                    n = ln(1 - confidence) / ln(1 - prevalence)

Example:

                    If prevalence of relevant = 1% (0.01)

                    For 95% confidence (0.95):

                    n = ln(1 - 0.95) / ln(1 - 0.01)

                    n = ln(0.05) / ln(0.99)

                    n ≈ 299 records to sample

Quick Reference Table

Tamaños de muestra para validación

Prevalence 0.5%, 95% conf 598 records

Prevalence 1%, 95% conf 299 records

Prevalence 2%, 95% conf 149 records

Prevalence 5%, 95% conf 59 records

Practical minimum 200 records (conservative)

Informar su Validación

Texto de métodos de ejemplo:

"Utilizamos ASReview LAB (v1.2) para la selección de títulos/resúmenes con
active learning. Screening ceased after 150 consecutive
irrelevant records, having screened 1,247 of 4,892 records
(25%). To validate recall, we manually screened a random
sample of 300 unscreened records. No additional relevant
se identificaron estudios, lo que sugiere una recuperación estimada ≥95%
(binomial 95% CI: 91-100%)."

"La validación no es opcional; es el precio de la eficiencia.
Calculate your sample. Screen it manually.
Informe lo que encontró. Admite lo que podrías haberte perdido."