2025-08-10 12:27:39 -07:00
|
|
|
# MicroQA
|
|
|
|
|
|
2026-01-16 04:55:50 +00:00
|
|
|
MicroQA assists with quality assurance for the Internet Archive's microfiche
|
|
|
|
|
scanning team by screening for indicators of common scanning and post-processing
|
|
|
|
|
mishaps including missed focus, disorientation, and over-cropping. As the
|
|
|
|
|
Democracy's Library project expands to archive millions of pages of official
|
|
|
|
|
documents, automated screening with OCR and conventional image processing
|
|
|
|
|
multiplies the human QA team's capacity to conduct targeted spot checks,
|
|
|
|
|
re-reviews, and manual re-scans.
|
2025-08-10 12:27:39 -07:00
|
|
|
|
2026-01-16 04:55:50 +00:00
|
|
|
Two pluggable OCR back-ends are available:
|
2025-08-10 12:27:39 -07:00
|
|
|
|
2026-01-16 04:55:50 +00:00
|
|
|
- [Tesseract](https://github.com/tesseract-ocr/tesseract) -- CPU based,
|
|
|
|
|
reliable, fast multi-lingual support
|
|
|
|
|
- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR) -- 1-shot orientation
|
|
|
|
|
detection, offers GPU or NPU acceleration
|
2025-08-10 12:27:39 -07:00
|
|
|
|
2026-01-16 04:55:50 +00:00
|
|
|
## Quickstart
|
2025-08-10 12:27:39 -07:00
|
|
|
|
2026-01-16 04:55:50 +00:00
|
|
|
[Mise-en-place](https://mise.jdx.dev/) manages top-level dev dependencies,
|
|
|
|
|
except for OCR back-ends. If you already have a recent version of `uv`
|
|
|
|
|
installed, you may skip `mise install` and use the existing binary instead.
|
2025-08-10 12:27:39 -07:00
|
|
|
|
|
|
|
|
```sh
|
2026-01-16 04:55:50 +00:00
|
|
|
mise install
|
|
|
|
|
|
|
|
|
|
uv sync
|
|
|
|
|
|
|
|
|
|
sudo apt install tesseract-ocr # Debian, Ubuntu
|
|
|
|
|
sudo dnf install tesseract # Fedora
|
|
|
|
|
brew install tesseract # macOS
|
|
|
|
|
|
|
|
|
|
# Run page diagnostics on a single Archive item:
|
|
|
|
|
uv run diagnostics.py micro_IA04244212_1665 --verbose
|
2025-08-10 12:27:39 -07:00
|
|
|
```
|
2025-08-10 22:10:16 -07:00
|
|
|
|
2026-01-16 04:55:50 +00:00
|
|
|
## Production Usage
|
|
|
|
|
|
|
|
|
|
MicroQA integrates with [Phonograph](https://www.phono.dev/) to coordinate
|
|
|
|
|
between concurrent workers, store data, and make analysis results navigable at
|
|
|
|
|
scale. Phonograph exposes a PostgreSQL interface, so MicroQA can run
|
|
|
|
|
independently if supplied a PostgreSQL database with a schema equivalent to:
|
2025-08-18 20:31:55 -07:00
|
|
|
|
|
|
|
|
```sql
|
2026-01-16 04:55:50 +00:00
|
|
|
create schema phono;
|
|
|
|
|
|
|
|
|
|
create table phono.items (
|
|
|
|
|
_id uuid primary key not null default gen_random_uuid(),
|
|
|
|
|
ia_id text,
|
|
|
|
|
oai_updatedate timestamptz,
|
|
|
|
|
started_date timestamptz,
|
|
|
|
|
completed_date timestamptz,
|
|
|
|
|
review_date text,
|
|
|
|
|
url text,
|
|
|
|
|
docs_link text
|
|
|
|
|
);
|
|
|
|
|
|
|
|
|
|
create table phono.docs (
|
|
|
|
|
_id uuid primary key not null default gen_random_uuid(),
|
|
|
|
|
name text,
|
|
|
|
|
item uuid,
|
|
|
|
|
pages_link text
|
|
|
|
|
);
|
|
|
|
|
|
|
|
|
|
create table phono.pages (
|
|
|
|
|
_id uuid primary key not null default gen_random_uuid(),
|
|
|
|
|
doc uuid,
|
|
|
|
|
page numeric,
|
|
|
|
|
page_angle numeric,
|
|
|
|
|
sharpness numeric,
|
|
|
|
|
is_blank text,
|
|
|
|
|
text_margin_px numeric,
|
|
|
|
|
url text
|
|
|
|
|
);
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
You can then start a worker with:
|
|
|
|
|
|
|
|
|
|
```sh
|
|
|
|
|
uv run main.py --database <DATABASE URL>
|
2025-08-18 20:31:55 -07:00
|
|
|
```
|
|
|
|
|
|
2025-08-10 22:10:16 -07:00
|
|
|
## Test Cases
|
|
|
|
|
|
|
|
|
|
- Blurry pages: `micro_IA40244209_0984`
|
|
|
|
|
- Contrast, page orientation: `micro_IA40244211_2290`
|
|
|
|
|
- Crop, low quality fiche: `micro_IA40386420_0689`
|
2026-01-16 04:55:50 +00:00
|
|
|
- "Bite sized" SCOTUS doc with multiple viewable files and some blurry pages:
|
|
|
|
|
`micro_IA40386007_0012`
|