VICP Registry ← about Home Attorneys Stats Special Masters About Us Help Support our AI costs

Methodology

This page documents the data provenance, processing pipeline, and known limitations of the VICP Registry. It exists separately from about because the question "how reliable is this data?" deserves its own canonical answer.

Short version: public U.S. Court of Federal Claims records → automated download from GovInfo → text extraction → AI-assisted structured-field extraction (Google Gemini) → manual review on a rolling basis → SQLite database that powers this site. Every case page links back to its source PDF on the federal site.

Data sources

Every case in the registry traces back to one or more of the source documents above. The PDF link on each case page goes directly to the GovInfo-served original.

Processing pipeline

The pipeline runs as five sequential stages. Each stage writes to its own SQLite table and is independently re-runnable.

  1. Discovery — query GovInfo's collection API for new packages in the USCOURTS-cofc collection matching the Vaccine Program docket pattern. Output: raw_packages table.
  2. Download — fetch every granule (decision, opinion, stipulation) listed for each package. Triage: skip pure clerical entries (notices, scheduling orders) and download substantive rulings. Output: PDF files on disk + raw_granules table.
  3. Text extraction — pdfplumber-based PDF→text. Output: staging_text table with full plain-text bodies. Page offsets retained for citation.
  4. Structured-field extraction — Google Gemini 2.5 Flash-Lite reads the staged text and emits a JSON record per case: petitioner identifier, vaccine type, condition raw + canonical category, decision date, filing date, outcome, compensation amount, age at vaccination, time to onset, dose number, concurrent vaccines, theory of causation, is_death flag, AI summary, and several derived flags. Output: curated_cases table at extraction_version='gemini-v2'.
  5. Curation & review — high-volume / contested cases, all death cases, and any case flagged for known issues receive a human-review pass. Corrections are written back as extraction_version='hybrid-v3'. Audit lineage preserved in audit_field_sources.

Known limitations

The registry is best-effort and has documented limitations. The single most important rule: always verify any specific fact against the source PDF linked on the case page.

Update cadence

New cases are pulled from GovInfo on a rolling basis (typically within 1–4 weeks of publication). Re-extraction passes happen periodically when prompt or schema improvements are made — usually quarterly. Major audits (like the 2026-05 award-amount correction sweep) are documented in the changelog area of the data and announced in this page's revision history.

Last large pipeline change: extraction schema v2 (Gemini-based curation), 2026-04. Last data audit: 2026-05.

Reporting errors

If you spot an inaccuracy, please report it. Corrections are reviewed against the source PDF and applied within ~7 days. Reach out via Ko-fi message or via X (@not_stefan0).

Things that help when reporting: the case URL or package_id, the field that's wrong, what the correct value should be, and a quote from the source PDF supporting the correction.

Not affiliated with the U.S. Department of Health and Human Services (HHS), the Vaccine Injury Compensation Program (VICP), the Health Resources and Services Administration (HRSA), or any government agency. This is an independent research project. Official program information is at hrsa.gov/vaccine-compensation.