Recursive Summarizer
September 14, 2023
What it does
Recursive Summarizer turns any PDF into a tight recap.
It extracts every page, slices the text into 3-page chunks, asks GPT-3.5-turbo to summarize each slice, stitches the mini-summaries together, and keeps compressing until the final text fits your target length.
Quick facts
- Python script, one dependency stack (
pdfplumber,openai) - Works on scans or born-digital PDFs (OCR quality affects output)
- Defaults: ~3 200-character chunks, 10-page final cap
- Saves both an intermediate dump and a final
summary.txt
How it works
-
Extract
pdf_to_textwalks every page withpdfplumberand returns raw text. -
Chunk
split_textsplits the corpus into ~3-page slices so the model stays within context. -
Summarize
get_summarycallsChatCompletiononce per slice, writes an intermediate file, and returns the snippet. -
Re-curse
If the stitched output is still too long, the script passes it back to GPT for a tighter pass until it fits. -
Persist
The final summary lands insummary.txt.
# one-liner demo
python recursive_summarizer.py --pdf ~/Docs/whitepaper.pdf
Tune it
| Variable | What it controls | | ----------------- | -------------------------------------------------------------- | | max_page_length | Rough characters per PDF page | | chunk_length | Characters sent to GPT in one go (max_page_length * 3 by default) | | target_length | Max size of the final summary |
Adjust these three numbers to balance cost, speed, and detail.
Caveats
- Output quality hinges on the PDF's text layer; bad scans in → garbled text out.
- Hard limits can trim nuance if you aim for ultra-short summaries.
- Each GPT call costs tokens; monster PDFs may run up your bill.
Get it
Source and MIT license live here: https://github.com/patrickmaub/recursive-summarizer
Clone, add your OPENAI_API_KEY, point at a PDF, and watch it shrink.