Large Document Processing Strategies -; Project Index

Overview

A context window is a finite, expensive resource, and most documents are mostly irrelevant to any given question. This was the experimentation ground for fixing that: how do you chop a big document up, find the parts that matter, and hand the model only those -; without losing the structure that makes the document make sense?

It was a playground more than a product. The point was to compare strategies -; different chunkers, different retrieval methods, local versus cloud embeddings -; and figure out what actually worked.

Background

The recurring frustration: feeding a model a whole PDF wastes most of the window on text that doesn't help, and naive chunking shreds the document's logic -; splitting a sentence in half, or burying the one relevant paragraph in noise. The interesting question was whether you could chunk along semantic boundaries instead of arbitrary character counts, and whether local models were good enough to do the embedding and relevance scoring without paying per token.

How It Works

The preprocessor ingests mixed formats -; PDF, Word, EPUB, plain text, and web URLs -; then offers three chunking strategies. Fixed chunking is the simple token-with-overlap baseline. Recursive chunking respects document structure (paragraphs, then sentences, then words). Semantic chunking is the interesting one: it embeds each sentence and places a boundary wherever the similarity between adjacent sentences drops below a threshold.

boundary if cos_sim(sᵢ, sᵢ₊₁) < mean − 1·std -; semantic chunk break

Embeddings run locally on GPU via sentence-transformers (all-MiniLM-L6-v2, all-mpnet-base-v2) with an OpenAI fallback. On top sit semantic and hybrid (semantic + keyword) search, plus entity, theme, and sentiment extraction. A reference implementation pushed it further onto PostgreSQL + pgvector with an MCP server interface.

Current Status

Archived as a Summer 2025 exploration -; and a productive one. It answered its questions about chunking and local embeddings, and the lessons walked straight into two follow-on tools (see Lineage). This page is the record of where those ideas were worked out.

Three chunkers (fixed, recursive, semantic) with local-GPU embeddings.
Semantic + hybrid search, plus entity/theme/sentiment extraction.
pgvector + MCP reference implementation; superseded by its descendants.