Key Takeaways: Markdown has become the de facto input format for LLMA neural network trained on vast amounts of text data to understand and generate human language. LLMs use the Transformer architecture and can perform a wide range of tasks — summarization,…-based agents because it preserves structure without introducing noise. PDF is the hardest common format to convert cleanly — its 30-year history of layout-first design actively works against you. Converting a PDF to Markdown is not a one-liner: the right tool depends on whether the PDF has a text layer, and post-processing to strip headers and footers is a required step, not an optional cleanup. For most PDFs,
pymupdf4llmoutperformsmarkitdown; Claude Vision is the last resort.
Why This Series Exists
Agents don’t read files. They read text. More precisely, they read tokenThe basic unit of text processed by an LLM. A token is roughly 4 characters or 0.75 words in English. LLMs process and generate text as sequences of tokens. Tokenization varies by model and language. sequences — and the structure of that sequence matters as much as its content. Feed an agent a raw dump of a Word document and it will struggle to distinguish a heading from a paragraph, a table cell from a footnote, a numbered list item from a prose sentence. That confusion propagates through every reasoning step that follows.
Markdown solves this cleanly. It is plain text with a lightweight, unambiguous structure signal. A ## heading is a heading. A | table is a table. A ` block is code. There is no presentation layer to strip, no embedded binary to decode, no XML namespace to parse. An agent reading a well-formed Markdown document spends its attention on content, not format.
This series covers the conversion of common file formats to Markdown: what the tools are, where they fail, and how to build a reliable pipeline for each. We start with PDF because it is the most widely used document format in enterprise workflows and the hardest to convert correctly.
PDF: 30 Years of Layout-First Design
PDF was created by Adobe in 1993, originally as a way to share documents that would look identical regardless of the operating system, fonts, or printer attached. The key design decision was to describe pages as visual layouts — precise positions of glyphs on a coordinate plane — rather than as structured content. A PDF does not inherently know that a line of text in a larger font is a heading. It knows that a string of characters was placed at coordinates (x, y) with a certain font size and weight.
This decision made PDF excellent for print fidelity and terrible for machine readability.
Over the years, the format expanded significantly:
- PDF 1.0–1.3 (1993–1999): basic layout modelA mathematical function trained on data that maps inputs to outputs. In ML, a model is the artifact produced after training — it encapsulates learned patterns and is used to make predictions or…, no structure semantics
- PDF 1.4–1.6 (2001–2005): transparency, layers, embedded files, JavaScript
- PDF 1.7 / ISO 32000-1 (2008): standardised by ISO; introduced Tagged PDF, which adds a logical structure tree on top of the visual layer — but tagging is optional and rarely present in practice
- PDF/A (ISO 19005, 2005–present): archival subset; mandates embedded fonts and prohibits encryption, but does not require structural tagging
- PDF 2.0 / ISO 32000-2 (2017): improved accessibility support, mandatory tagging guidelines — but adoption is still limited
The consequence is that most PDFs in the wild today — scanned invoices, exported reports, government forms — carry no structural metadata. They are images of text, or streams of glyphs without paragraph or heading markers. A converter has to reconstruct that structure from visual heuristics: font size, position, whitespace, line breaks.
That reconstruction is where tools diverge.
The Three-Tier Tool Landscape
markitdown — Fast, Text-Layer Only
markitdown (Microsoft, available as an MCPAn open protocol developed by Anthropic that standardizes how AI models connect to external tools, data sources, and services. MCP allows LLMs to call tools (file systems, APIs, databases) in a… server) uses pdfminer.six as its PDF backend. pdfminer.six extracts the character stream from the PDF’s content layer and reassembles it into text. It is fast, pure Python, and requires no external dependencies.
The limitation is fundamental: it produces flat text with no heading markers whatsoever. If the PDF has no text layer — which is the case for any scanned document, any image-based export — it returns nothing or garbage. It cannot OCR.
Use markitdown when you have clean, text-layer PDFs and you only need the raw content without structure. For anything more complex, it falls short.
pymupdf4llm — Structure Detection with OCR Fallback
pymupdf4llm uses PyMuPDF as its base and adds a heading-detection layer that reads font metrics from the PDF’s internal glyph data. Lines rendered in a larger font become ## headings. It also falls back to Tesseract OCR when no text layer is present, making it usable on scanned documents.
The OCR fallback has a significant limitation: once you are in OCR territory, font size information is gone — all headings land at ## because the OCR engine cannot infer visual weight from pixels reliably. You get a flat heading hierarchy rather than a true # / ## / ### structure.
For PDFs with a text layer, pymupdf4llm produces noticeably better output than markitdown: you get actual heading markers, and the structural reconstruction is solid. It runs in an isolated virtualenv (system Python is protected on modern systems):
uv venv && uv pip install pymupdf4llm
uv run python -c "
import pymupdf4llm
md = pymupdf4llm.to_markdown('document.pdf')
with open('output.md', 'w') as f:
f.write(md)
"
Claude Vision — Last Resort for Full Hierarchy
When OCR is involved and heading hierarchy matters, Claude Vision can read the rendered page as an image and infer heading levels from visual weight — something no text-extraction tool can do reliably. It produces the most structurally accurate output.
The cost is real: each page is sent as an image, and tokenThe basic unit of text processed by an LLM. A token is roughly 4 characters or 0.75 words in English. LLMs process and generate text as sequences of tokens. Tokenization varies by model and language. consumption scales linearly with page count. A 50-page document through Claude Vision costs roughly 10–20× what pymupdf4llm costs. Reserve it for documents where structural fidelity is a hard requirement and page count is manageable.
Tool Comparison
| Tool | Backend | Requires text layer? | Heading detection | Relative cost |
|---|---|---|---|---|
markitdown |
pdfminer.six | Yes | None (flat text) | Low |
pymupdf4llm |
PyMuPDF + Tesseract | No (OCR fallback) | ## only (flat hierarchy on OCR) |
Low–Medium |
| Claude Vision | Vision API | No | Full #/##/### hierarchy |
High |
The Part Everyone Skips: Post-Conversion Cleaning
PDF footers — page numbers, company addresses, horizontal rules, legal disclaimers — are extracted as regular text. They appear once per page throughout your Markdown output. An agent reading that document will encounter the footer noise repeatedly, mixed into the content.
Cleaning is not optional. It is a required step in any production pipeline.
Standalone page numbers are the easy case:
sed -i '/^\s*[0-9]\+\s*$/d' output.md
# verify:
grep -c '^\s*[0-9]\+\s*$' output.md # must return 0
Complex footers — a separator line followed by a company name and URL — need a regex. The pattern is always the same: identify what the footer looks like in the raw output (grep for a distinctive string), then remove it:
import re
with open('output.md') as f:
text = f.read()
# Example: footer pattern from a real project
text = re.sub(r'_{10,}.*?Company Name.*?company\.com\n?', '', text, flags=re.DOTALL)
with open('output.md', 'w') as f:
f.write(text)
The general pipeline is:
- Convert PDF → raw Markdown
- Grep for the repeating footer pattern
- Strip it with
sedor Pythonre.sub - Verify with
grep -c <pattern>returning 0
Choosing the Right Tool
The decision tree is short:
- Does the PDF have a selectable text layer? (Open it in a viewer and try to select text.) If yes, use
pymupdf4llm. If structural hierarchy doesn’t matter,markitdownis faster. - Is the PDF image-based (scanned)? Use
pymupdf4llmfor its OCR fallback. Accept flat heading hierarchy. - Is heading hierarchy critical and page count under ~30? Use Claude Vision.
- Is heading hierarchy critical and page count high? Consider post-processing heuristics first: promote lines matching
^\d+\.\sto##,^\d+\.\d+\.\sto###. It covers the most common cases at zero extra cost.
Both conversions can run in parallel when processing multiple PDFs — there is no shared state to worry about.
We Packaged This Into a Skill
Everything above — tool selection, OCR fallback, footer cleaning, verification — is now packaged into a public agent skill: /utils:pdf.
Install it from the Trobz public skills repository and invoke it directly from your agent session. Point it at a PDF and it picks the right tool, runs the conversion, strips repeating footer patterns, and hands you clean Markdown ready to feed into your agent pipeline.