# MicroQA MicroQA assists with quality assurance for the Internet Archive's microfiche scanning team by screening for indicators of common scanning and post-processing mishaps including missed focus, disorientation, and over-cropping. As the Democracy's Library project expands to archive millions of pages of official documents, automated screening with OCR and conventional image processing multiplies the human QA team's capacity to conduct targeted spot checks, re-reviews, and manual re-scans. Two pluggable OCR back-ends are available: - [Tesseract](https://github.com/tesseract-ocr/tesseract) -- CPU based, reliable, fast multi-lingual support - [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR) -- 1-shot orientation detection, offers GPU or NPU acceleration ## Quickstart [Mise-en-place](https://mise.jdx.dev/) manages top-level dev dependencies, except for OCR back-ends. If you already have a recent version of `uv` installed, you may skip `mise install` and use the existing binary instead. ```sh mise install uv sync sudo apt install tesseract-ocr # Debian, Ubuntu sudo dnf install tesseract # Fedora brew install tesseract # macOS # Run page diagnostics on a single Archive item: uv run diagnostics.py micro_IA04244212_1665 --verbose ``` ## Production Usage MicroQA integrates with [Phonograph](https://www.phono.dev/) to coordinate between concurrent workers, store data, and make analysis results navigable at scale. Phonograph exposes a PostgreSQL interface, so MicroQA can run independently if supplied a PostgreSQL database with a schema equivalent to: ```sql create schema phono; create table phono.items ( _id uuid primary key not null default gen_random_uuid(), ia_id text, oai_updatedate timestamptz, started_date timestamptz, completed_date timestamptz, review_date text, url text, docs_link text ); create table phono.docs ( _id uuid primary key not null default gen_random_uuid(), name text, item uuid, pages_link text ); create table phono.pages ( _id uuid primary key not null default gen_random_uuid(), doc uuid, page numeric, page_angle numeric, sharpness numeric, is_blank text, text_margin_px numeric, url text ); ``` You can then start a worker with: ```sh uv run main.py --database ``` ## Test Cases - Blurry pages: `micro_IA40244209_0984` - Contrast, page orientation: `micro_IA40244211_2290` - Crop, low quality fiche: `micro_IA40386420_0689` - "Bite sized" SCOTUS doc with multiple viewable files and some blurry pages: `micro_IA40386007_0012`