# MicroQA

MicroQA assists with quality assurance for the Internet Archive's microfiche
scanning team by screening for indicators of common scanning and post-processing
mishaps including missed focus, disorientation, and over-cropping. As the
Democracy's Library project expands to archive millions of pages of official
documents, automated screening with OCR and conventional image processing
multiplies the human QA team's capacity to conduct targeted spot checks,
re-reviews, and manual re-scans.

Two pluggable OCR back-ends are available:

- [Tesseract](https://github.com/tesseract-ocr/tesseract) -- CPU based,
  reliable, fast multi-lingual support
- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR) -- 1-shot orientation
  detection, offers GPU or NPU acceleration

## Quickstart

[Mise-en-place](https://mise.jdx.dev/) manages top-level dev dependencies,
except for OCR back-ends. If you already have a recent version of `uv`
installed, you may skip `mise install` and use the existing binary instead.

```sh
mise install

uv sync

sudo apt install tesseract-ocr # Debian, Ubuntu
sudo dnf install tesseract # Fedora
brew install tesseract # macOS

# Run page diagnostics on a single Archive item:
uv run diagnostics.py micro_IA04244212_1665 --verbose
```

## Production Usage

MicroQA integrates with [Phonograph](https://www.phono.dev/) to coordinate
between concurrent workers, store data, and make analysis results navigable at
scale. Phonograph exposes a PostgreSQL interface, so MicroQA can run
independently if supplied a PostgreSQL database with a schema equivalent to:

```sql
create schema phono;

create table phono.items (
  _id            uuid primary key not null default gen_random_uuid(),
  ia_id          text,
  oai_updatedate timestamptz,
  started_date   timestamptz,
  completed_date timestamptz,
  review_date    text,
  url            text,
  docs_link      text
);

create table phono.docs (
  _id        uuid primary key not null default gen_random_uuid(),
  name       text,
  item       uuid,
  pages_link text
);

create table phono.pages (
  _id            uuid primary key not null default gen_random_uuid(),
  doc            uuid,
  page           numeric,
  page_angle     numeric,
  sharpness      numeric,
  is_blank       text,
  text_margin_px numeric,
  url            text
);
```

You can then start a worker with:

```sh
uv run main.py --database <DATABASE URL>
```

## Test Cases

- Blurry pages: `micro_IA40244209_0984`
- Contrast, page orientation: `micro_IA40244211_2290`
- Crop, low quality fiche: `micro_IA40386420_0689`
- "Bite sized" SCOTUS doc with multiple viewable files and some blurry pages:
  `micro_IA40386007_0012`