QA screening tool for the Democracy's Library project
Find a file
2026-01-15 21:33:00 +00:00
microqa replace sharpness edge detection with gradient 2026-01-15 21:33:00 +00:00
.gitignore rewrite data fetching into archive_item.py 2025-10-04 18:03:03 -07:00
.python-version init 2025-08-10 12:27:39 -07:00
diagnostics.py improve contrast norm and sharpness measurement 2026-01-15 21:31:55 +00:00
main.py switch from sqlite to phonograph 2026-01-15 21:31:55 +00:00
mise.toml add interchangeable ocr engines 2025-11-07 05:41:18 +00:00
one_off.py rewrite to engine.py 2025-10-04 15:10:10 -07:00
pyproject.toml improve contrast norm and sharpness measurement 2026-01-15 21:31:55 +00:00
README.md rewrite to engine.py 2025-10-04 15:10:10 -07:00
uv.lock improve contrast norm and sharpness measurement 2026-01-15 21:31:55 +00:00

MicroQA

QA assistant for the Internet Archive's microfiche scanning team.

Usage

Analyze page statistics for item:

echo 'micro_IA04244212_1665' | uv run main.py | jq

Paste item IDs from clipboard and summarize all (tr command collapses input to a single line so that items are summarized in parallel):

pbpaste | tr '\n' ',' | uv run main.py --summarize -workers 4 -v | jq

Query a pre-populated database for suspect pages:

select   'https://archive.org/details/' || items.id,
         pages.page,
         pages.orientation_match,
         pages.sharpness,
         pages.text_margin_px
from     items
         join pages on pages.item = items.id
where    pages.orientation_match = 0
         or pages.sharpness < 0.07
         or (pages.text_margin_px > -1 and pages.text_margin_px < 50)
order by items.id;

Test Cases

  • Blurry pages: micro_IA40244209_0984
  • Contrast, page orientation: micro_IA40244211_2290
  • Crop, low quality fiche: micro_IA40386420_0689
  • "Bite sized" SCOTUS doc with multiple viewable files and some blurry pages: micro_IA40386007_0012