pdf-preparer -; Project Index

Overview

Dumping a whole PDF into a context window is wasteful and often doesn't fit. This preprocesses a large PDF into a Claude-friendly package: a compressed copy, all the extracted text with page markers, contact-sheet thumbnail grids (~30 pages each), and individual page images. Claude skims the text and contact sheets to find what matters, then pulls specific page images for the deep read -; a fraction of the context a page-by-page pass would burn.

It's both a CLI (pdf-prep) and an MCP server, so it works as a batch preprocessor or as a live tool inside a Claude session.

How It Works

The pipeline produces four things from one PDF:

Compressed PDF -; a smaller copy of the original (pymupdf, or Ghostscript for heavier compression).
Text file -; all extracted text with page markers, so it's searchable and greppable.
Contact sheets -; thumbnail overview grids showing ~30 labeled pages each.
Page images -; individual JPGs for selective deep-dives.

Scanned/image-only PDFs get OCR (Tesseract, or the ocrmypdf pipeline when available). On the MCP side the same workflow surfaces as five tools -; prepare_pdf, list_prepared_documents, get_contact_sheets, get_page_image, and get_document_text (optionally filtered by page range) -; which map directly onto "skim, then dive."

# CLI: run the full pipeline on one PDF
pdf-prep process document.pdf -o ./output --dpi 200

# MCP: prepare, skim contact sheets, then dive on specific pages
prepare_pdf  ->  get_contact_sheets  ->  get_page_image / get_document_text

Current Status

Active -; this is the tool I actually reach for when a PDF needs to go into Claude, and the MCP server is wired up and running. CLI and MCP layers both work; optional dependencies (Tesseract, Ghostscript, ocrmypdf) cover the OCR and heavier-compression cases.

Four-output pipeline: compressed PDF, page-marked text, contact sheets, page images.
Five MCP tools mapping onto the index-and-dive workflow.
Batch mode for whole folders of PDFs; per-document info check.