MicroQA

MicroQA assists with quality assurance for the Internet Archive's microfiche scanning team by screening for indicators of common scanning and post-processing mishaps including missed focus, disorientation, and over-cropping. As the Democracy's Library project expands to archive millions of pages of official documents, automated screening with OCR and conventional image processing multiplies the human QA team's capacity to conduct targeted spot checks, re-reviews, and manual re-scans.

Two pluggable OCR back-ends are available:

Tesseract -- CPU based, reliable, fast multi-lingual support
PaddleOCR -- 1-shot orientation detection, offers GPU or NPU acceleration

Quickstart

Mise-en-place manages top-level dev dependencies, except for OCR back-ends. If you already have a recent version of uv installed, you may skip mise install and use the existing binary instead.

mise install

uv sync

sudo apt install tesseract-ocr # Debian, Ubuntu
sudo dnf install tesseract # Fedora
brew install tesseract # macOS

# Run page diagnostics on a single Archive item:
uv run diagnostics.py micro_IA04244212_1665 --verbose

Production Usage

MicroQA integrates with Phonograph to coordinate between concurrent workers, store data, and make analysis results navigable at scale. Phonograph exposes a PostgreSQL interface, so MicroQA can run independently if supplied a PostgreSQL database with a schema equivalent to:

create schema phono;

create table phono.items (
  _id            uuid primary key not null default gen_random_uuid(),
  ia_id          text,
  oai_updatedate timestamptz,
  started_date   timestamptz,
  completed_date timestamptz,
  review_date    text,
  url            text,
  docs_link      text
);

create table phono.docs (
  _id        uuid primary key not null default gen_random_uuid(),
  name       text,
  item       uuid,
  pages_link text
);

create table phono.pages (
  _id            uuid primary key not null default gen_random_uuid(),
  doc            uuid,
  page           numeric,
  page_angle     numeric,
  sharpness      numeric,
  is_blank       text,
  text_margin_px numeric,
  url            text
);

You can then start a worker with:

uv run main.py --database <DATABASE URL>

Test Cases

Blurry pages: micro_IA40244209_0984
Contrast, page orientation: micro_IA40244211_2290
Crop, low quality fiche: micro_IA40386420_0689
"Bite sized" SCOTUS doc with multiple viewable files and some blurry pages: micro_IA40386007_0012

2.6 KiB Raw Permalink Blame History

MicroQA

Quickstart

Production Usage

Test Cases

2.6 KiB

Raw Permalink Blame History