Methodology
This page documents the data provenance, processing pipeline, and known limitations of the VICP Registry. It exists separately from about because the question "how reliable is this data?" deserves its own canonical answer.
Short version: public U.S. Court of Federal Claims records → automated download from GovInfo → text extraction → AI-assisted structured-field extraction (Google Gemini) → manual review on a rolling basis → SQLite database that powers this site. Every case page links back to its source PDF on the federal site.
Data sources
- GovInfo.gov — the U.S. Government Publishing Office. Primary source: official PDFs of every published Special Master decision, Court of Federal Claims opinion, and stipulation in the Vaccine Program. Coverage: 1988 to present. Authentication: each document carries a GovInfo-issued cryptographic signature.
- CourtListener / Free Law Project — cross-reference source for opinions that GovInfo has not yet ingested or that exist only in the historical Harvard archive. Used selectively to fill pre-2010 gaps and to recover cases where GovInfo returns 404.
- U.S. Court of Federal Claims docket pages — public PACER records. Cross-referenced for filing dates, party names, and case histories that aren't in the published opinion text.
- HRSA published statistics — used to validate aggregate counts (totals filed, totals adjudicated). The CDC and HRSA do not publish per-case data; this registry is the closest public-facing equivalent.
Every case in the registry traces back to one or more of the source documents above. The PDF link on each case page goes directly to the GovInfo-served original.
Processing pipeline
The pipeline runs as five sequential stages. Each stage writes to its own SQLite table and is independently re-runnable.
- Discovery — query GovInfo's collection API for new packages in the
USCOURTS-cofccollection matching the Vaccine Program docket pattern. Output:raw_packagestable. - Download — fetch every granule (decision, opinion, stipulation) listed for each package. Triage: skip pure clerical entries (notices, scheduling orders) and download substantive rulings. Output: PDF files on disk +
raw_granulestable. - Text extraction — pdfplumber-based PDF→text. Output:
staging_texttable with full plain-text bodies. Page offsets retained for citation. - Structured-field extraction — Google Gemini 2.5 Flash-Lite reads the staged text and emits a JSON record per case: petitioner identifier, vaccine type, condition raw + canonical category, decision date, filing date, outcome, compensation amount, age at vaccination, time to onset, dose number, concurrent vaccines, theory of causation, is_death flag, AI summary, and several derived flags. Output:
curated_casestable atextraction_version='gemini-v2'. - Curation & review — high-volume / contested cases, all death cases, and any case flagged for known issues receive a human-review pass. Corrections are written back as
extraction_version='hybrid-v3'. Audit lineage preserved inaudit_field_sources.
Known limitations
The registry is best-effort and has documented limitations. The single most important rule: always verify any specific fact against the source PDF linked on the case page.
- AI extraction is imperfect. Petitioner names, dates, and compensation amounts may contain errors — particularly for older cases (1990s) where PDF text quality is poor, and for complex cases involving multiple stipulations or amended petitions. Sample human-review rate is high on receipt-cohort cases (death cases, $1M+ compensations) but mid on the long tail of routine adult SIRVA stipulations.
- Award-amount unit inconsistency (now corrected). The
award_amount_usdfield had a known regex-extraction bug where some high-value cases were stored as cents (e.g., $2,093,116.37 stored as integer 209311637) while others were stored as dollars. A May 2026 audit corrected ~75 affected entries by cross-referencing each value against the case_summary text. Corrections are logged in the audit trail. - Petitioner names are sometimes incomplete. Pre-2026 extraction stored only the surname for many cases ("Roedl" rather than "Mary Roedl, on behalf of Brian J. Roedl"). A backfill is in progress; until it completes, some titles read as last-name-only. The original source PDF always contains the full case caption.
- Anonymized initials are preserved as-is. When the court chose to anonymize a minor's name as initials (e.g., "M.M.", "K.S.J., Jr."), the registry preserves that convention. We do not attempt to de-anonymize. Initials are stored all-caps with dots (the court's convention).
- Special Master extraction is regex-based. ~95% of cases have the Special Master's name correctly identified; ~5% have parse errors or use legacy court-of-claims judges (pre-1988 program era). Cases where the SM is missing show "—" in the Special Master row of the case-facts panel.
- Case summaries are AI-generated. Each case has an AI-written summary that paraphrases the source decision. Summaries are clearly labeled and may omit or mischaracterize subtle legal reasoning. The disclaimer "AI summaries can sometimes make mistakes" appears on every summary panel.
- The corpus is filed cases, not injured population. Compensation rates and outcome distributions in this database describe the people who filed petitions in the program. They are not population-level injury rates. Many vaccine injuries never become VICP filings; many filings face procedural barriers unrelated to underlying clinical fact.
- Some cases are missing from the corpus entirely. Cases that were never published (settled pre-decision with no docketed opinion) are not in GovInfo and therefore not here. Some pre-2010 historical cases are CourtListener-only and may have been added selectively.
Update cadence
New cases are pulled from GovInfo on a rolling basis (typically within 1–4 weeks of publication). Re-extraction passes happen periodically when prompt or schema improvements are made — usually quarterly. Major audits (like the 2026-05 award-amount correction sweep) are documented in the changelog area of the data and announced in this page's revision history.
Last large pipeline change: extraction schema v2 (Gemini-based curation), 2026-04. Last data audit: 2026-05.
Reporting errors
If you spot an inaccuracy, please report it. Corrections are reviewed against the source PDF and applied within ~7 days. Reach out via Ko-fi message or via X (@not_stefan0).
Things that help when reporting: the case URL or package_id, the field that's wrong, what the correct value should be, and a quote from the source PDF supporting the correction.
Not affiliated with the U.S. Department of Health and Human Services (HHS), the Vaccine Injury Compensation Program (VICP), the Health Resources and Services Administration (HRSA), or any government agency. This is an independent research project. Official program information is at hrsa.gov/vaccine-compensation.