MicroQA/README.md

# MicroQA

MicroQA assists with quality assurance for the Internet Archive's microfiche
scanning team by screening for indicators of common scanning and post-processing
mishaps including missed focus, disorientation, and over-cropping. As the
Democracy's Library project expands to archive millions of pages of official
documents, automated screening with OCR and conventional image processing
multiplies the human QA team's capacity to conduct targeted spot checks,
re-reviews, and manual re-scans.

Two pluggable OCR back-ends are available:

- [Tesseract](https://github.com/tesseract-ocr/tesseract) -- CPU based,
  reliable, fast multi-lingual support
- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR) -- 1-shot orientation
  detection, offers GPU or NPU acceleration

## Quickstart

[Mise-en-place](https://mise.jdx.dev/) manages top-level dev dependencies,
except for OCR back-ends. If you already have a recent version of `uv`
installed, you may skip `mise install` and use the existing binary instead.

```sh
mise install

uv sync

sudo apt install tesseract-ocr # Debian, Ubuntu
sudo dnf install tesseract # Fedora
brew install tesseract # macOS

# Run page diagnostics on a single Archive item:
uv run diagnostics.py micro_IA04244212_1665 --verbose
```

## Production Usage

MicroQA integrates with [Phonograph](https://www.phono.dev/) to coordinate
between concurrent workers, store data, and make analysis results navigable at
scale. Phonograph exposes a PostgreSQL interface, so MicroQA can run
independently if supplied a PostgreSQL database with a schema equivalent to:

```sql
create schema phono;

create table phono.items (
  _id            uuid primary key not null default gen_random_uuid(),
  ia_id          text,
  oai_updatedate timestamptz,
  started_date   timestamptz,
  completed_date timestamptz,
  review_date    text,
  url            text,
  docs_link      text
);

create table phono.docs (
  _id        uuid primary key not null default gen_random_uuid(),
  name       text,
  item       uuid,
  pages_link text
);

create table phono.pages (
  _id            uuid primary key not null default gen_random_uuid(),
  doc            uuid,
  page           numeric,
  page_angle     numeric,
  sharpness      numeric,
  is_blank       text,
  text_margin_px numeric,
  url            text
);
```

You can then start a worker with:

```sh
uv run main.py --database <DATABASE URL>
```

## Test Cases

- Blurry pages: `micro_IA40244209_0984`
- Contrast, page orientation: `micro_IA40244211_2290`
- Crop, low quality fiche: `micro_IA40386420_0689`
- "Bite sized" SCOTUS doc with multiple viewable files and some blurry pages:
  `micro_IA40386007_0012`
init 2025-08-10 12:27:39 -07:00			`# MicroQA`

clean up outdated files 2026-01-16 04:55:50 +00:00			`MicroQA assists with quality assurance for the Internet Archive's microfiche`
			`scanning team by screening for indicators of common scanning and post-processing`
			`mishaps including missed focus, disorientation, and over-cropping. As the`
			`Democracy's Library project expands to archive millions of pages of official`
			`documents, automated screening with OCR and conventional image processing`
			`multiplies the human QA team's capacity to conduct targeted spot checks,`
			`re-reviews, and manual re-scans.`
init 2025-08-10 12:27:39 -07:00
clean up outdated files 2026-01-16 04:55:50 +00:00			`Two pluggable OCR back-ends are available:`
init 2025-08-10 12:27:39 -07:00
clean up outdated files 2026-01-16 04:55:50 +00:00			`- [Tesseract](https://github.com/tesseract-ocr/tesseract) -- CPU based,`
			`reliable, fast multi-lingual support`
			`- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR) -- 1-shot orientation`
			`detection, offers GPU or NPU acceleration`
init 2025-08-10 12:27:39 -07:00
clean up outdated files 2026-01-16 04:55:50 +00:00			`## Quickstart`
init 2025-08-10 12:27:39 -07:00
clean up outdated files 2026-01-16 04:55:50 +00:00			`[Mise-en-place](https://mise.jdx.dev/) manages top-level dev dependencies,`
			except for OCR back-ends. If you already have a recent version of `uv`
			installed, you may skip `mise install` and use the existing binary instead.
init 2025-08-10 12:27:39 -07:00
			```sh
clean up outdated files 2026-01-16 04:55:50 +00:00			`mise install`

			`uv sync`

			`sudo apt install tesseract-ocr # Debian, Ubuntu`
			`sudo dnf install tesseract # Fedora`
			`brew install tesseract # macOS`

			`# Run page diagnostics on a single Archive item:`
			`uv run diagnostics.py micro_IA04244212_1665 --verbose`
init 2025-08-10 12:27:39 -07:00			```
add ocr crop warnings 2025-08-10 22:10:16 -07:00
clean up outdated files 2026-01-16 04:55:50 +00:00			`## Production Usage`

			`MicroQA integrates with [Phonograph](https://www.phono.dev/) to coordinate`
			`between concurrent workers, store data, and make analysis results navigable at`
			`scale. Phonograph exposes a PostgreSQL interface, so MicroQA can run`
			`independently if supplied a PostgreSQL database with a schema equivalent to:`
store results to sqlite 2025-08-18 20:31:55 -07:00
			```sql
clean up outdated files 2026-01-16 04:55:50 +00:00			`create schema phono;`

			`create table phono.items (`
			`_id uuid primary key not null default gen_random_uuid(),`
			`ia_id text,`
			`oai_updatedate timestamptz,`
			`started_date timestamptz,`
			`completed_date timestamptz,`
			`review_date text,`
			`url text,`
			`docs_link text`
			`);`

			`create table phono.docs (`
			`_id uuid primary key not null default gen_random_uuid(),`
			`name text,`
			`item uuid,`
			`pages_link text`
			`);`

			`create table phono.pages (`
			`_id uuid primary key not null default gen_random_uuid(),`
			`doc uuid,`
			`page numeric,`
			`page_angle numeric,`
			`sharpness numeric,`
			`is_blank text,`
			`text_margin_px numeric,`
			`url text`
			`);`
			```

			`You can then start a worker with:`

			```sh
			`uv run main.py --database <DATABASE URL>`
store results to sqlite 2025-08-18 20:31:55 -07:00			```

add ocr crop warnings 2025-08-10 22:10:16 -07:00			`## Test Cases`

			- Blurry pages: `micro_IA40244209_0984`
			- Contrast, page orientation: `micro_IA40244211_2290`
			- Crop, low quality fiche: `micro_IA40386420_0689`
clean up outdated files 2026-01-16 04:55:50 +00:00			`- "Bite sized" SCOTUS doc with multiple viewable files and some blurry pages:`
			`micro_IA40386007_0012`