MicroQA/README.md

90 lines
2.6 KiB
Markdown
Raw Normal View History

2025-08-10 12:27:39 -07:00
# MicroQA
2026-01-16 04:55:50 +00:00
MicroQA assists with quality assurance for the Internet Archive's microfiche
scanning team by screening for indicators of common scanning and post-processing
mishaps including missed focus, disorientation, and over-cropping. As the
Democracy's Library project expands to archive millions of pages of official
documents, automated screening with OCR and conventional image processing
multiplies the human QA team's capacity to conduct targeted spot checks,
re-reviews, and manual re-scans.
2025-08-10 12:27:39 -07:00
2026-01-16 04:55:50 +00:00
Two pluggable OCR back-ends are available:
2025-08-10 12:27:39 -07:00
2026-01-16 04:55:50 +00:00
- [Tesseract](https://github.com/tesseract-ocr/tesseract) -- CPU based,
reliable, fast multi-lingual support
- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR) -- 1-shot orientation
detection, offers GPU or NPU acceleration
2025-08-10 12:27:39 -07:00
2026-01-16 04:55:50 +00:00
## Quickstart
2025-08-10 12:27:39 -07:00
2026-01-16 04:55:50 +00:00
[Mise-en-place](https://mise.jdx.dev/) manages top-level dev dependencies,
except for OCR back-ends. If you already have a recent version of `uv`
installed, you may skip `mise install` and use the existing binary instead.
2025-08-10 12:27:39 -07:00
```sh
2026-01-16 04:55:50 +00:00
mise install
uv sync
sudo apt install tesseract-ocr # Debian, Ubuntu
sudo dnf install tesseract # Fedora
brew install tesseract # macOS
# Run page diagnostics on a single Archive item:
uv run diagnostics.py micro_IA04244212_1665 --verbose
2025-08-10 12:27:39 -07:00
```
2025-08-10 22:10:16 -07:00
2026-01-16 04:55:50 +00:00
## Production Usage
MicroQA integrates with [Phonograph](https://www.phono.dev/) to coordinate
between concurrent workers, store data, and make analysis results navigable at
scale. Phonograph exposes a PostgreSQL interface, so MicroQA can run
independently if supplied a PostgreSQL database with a schema equivalent to:
2025-08-18 20:31:55 -07:00
```sql
2026-01-16 04:55:50 +00:00
create schema phono;
create table phono.items (
_id uuid primary key not null default gen_random_uuid(),
ia_id text,
oai_updatedate timestamptz,
started_date timestamptz,
completed_date timestamptz,
review_date text,
url text,
docs_link text
);
create table phono.docs (
_id uuid primary key not null default gen_random_uuid(),
name text,
item uuid,
pages_link text
);
create table phono.pages (
_id uuid primary key not null default gen_random_uuid(),
doc uuid,
page numeric,
page_angle numeric,
sharpness numeric,
is_blank text,
text_margin_px numeric,
url text
);
```
You can then start a worker with:
```sh
uv run main.py --database <DATABASE URL>
2025-08-18 20:31:55 -07:00
```
2025-08-10 22:10:16 -07:00
## Test Cases
- Blurry pages: `micro_IA40244209_0984`
- Contrast, page orientation: `micro_IA40244211_2290`
- Crop, low quality fiche: `micro_IA40386420_0689`
2026-01-16 04:55:50 +00:00
- "Bite sized" SCOTUS doc with multiple viewable files and some blurry pages:
`micro_IA40386007_0012`