Overview
Dumping a whole PDF into a context window is wasteful and often doesn't fit. This preprocesses a large PDF into a Claude-friendly package: a compressed copy, all the extracted text with page markers, contact-sheet thumbnail grids (~30 pages each), and individual page images. Claude skims the text and contact sheets to find what matters, then pulls specific page images for the deep read -; a fraction of the context a page-by-page pass would burn.
It's both a CLI (pdf-prep) and an MCP server, so it works as a batch
preprocessor or as a live tool inside a Claude session.
How It Works
The pipeline produces four things from one PDF:
- Compressed PDF -; a smaller copy of the original (pymupdf, or Ghostscript for heavier compression).
- Text file -; all extracted text with page markers, so it's searchable and greppable.
- Contact sheets -; thumbnail overview grids showing ~30 labeled pages each.
- Page images -; individual JPGs for selective deep-dives.
Scanned/image-only PDFs get OCR (Tesseract, or the ocrmypdf pipeline when available).
On the MCP side the same workflow surfaces as five tools -;
prepare_pdf, list_prepared_documents,
get_contact_sheets, get_page_image, and
get_document_text (optionally filtered by page range) -; which map
directly onto "skim, then dive."
# CLI: run the full pipeline on one PDF pdf-prep process document.pdf -o ./output --dpi 200 # MCP: prepare, skim contact sheets, then dive on specific pages prepare_pdf -> get_contact_sheets -> get_page_image / get_document_text
Current Status
Active -; this is the tool I actually reach for when a PDF needs to go into Claude, and the MCP server is wired up and running. CLI and MCP layers both work; optional dependencies (Tesseract, Ghostscript, ocrmypdf) cover the OCR and heavier-compression cases.
- Four-output pipeline: compressed PDF, page-marked text, contact sheets, page images.
- Five MCP tools mapping onto the index-and-dive workflow.
- Batch mode for whole folders of PDFs; per-document
infocheck.