Convert Everything to Markdown for AI Agents — Part 1: PDF Files

Agents don't read PDFs. They read text. Here's what happens when you feed them one, and how to convert PDFs to Markdown without silently losing structure.

Key Takeaways: Markdown has become the de facto input format for LLM-based agents because it preserves structure without introducing noise. PDF is the hardest common format to convert cleanly — its 30-year history of layout-first design actively works against you. Converting a PDF to Markdown is not a one-liner: the right tool depends on whether the PDF has a text layer, and post-processing to strip headers and footers is a required step, not an optional cleanup. For most PDFs, pymupdf4llm outperforms markitdown; Claude Vision is the last resort.

Why This Series Exists

Agents don’t read files. They read text. More precisely, they read token sequences — and the structure of that sequence matters as much as its content. Feed an agent a raw dump of a Word document and it will struggle to distinguish a heading from a paragraph, a table cell from a footnote, a numbered list item from a prose sentence. That confusion propagates through every reasoning step that follows.

Markdown solves this cleanly. It is plain text with a lightweight, unambiguous structure signal. A ## heading is a heading. A | table is a table. A ` block is code. There is no presentation layer to strip, no embedded binary to decode, no XML namespace to parse. An agent reading a well-formed Markdown document spends its attention on content, not format.

This series covers the conversion of common file formats to Markdown: what the tools are, where they fail, and how to build a reliable pipeline for each. We start with PDF because it is the most widely used document format in enterprise workflows and the hardest to convert correctly.

PDF: 30 Years of Layout-First Design

PDF was created by Adobe in 1993, originally as a way to share documents that would look identical regardless of the operating system, fonts, or printer attached. The key design decision was to describe pages as visual layouts — precise positions of glyphs on a coordinate plane — rather than as structured content. A PDF does not inherently know that a line of text in a larger font is a heading. It knows that a string of characters was placed at coordinates (x, y) with a certain font size and weight.

This decision made PDF excellent for print fidelity and terrible for machine readability.

Over the years, the format expanded significantly:

PDF 1.0–1.3 (1993–1999): basic layout model, no structure semantics
PDF 1.4–1.6 (2001–2005): transparency, layers, embedded files, JavaScript
PDF 1.7 / ISO 32000-1 (2008): standardised by ISO; introduced Tagged PDF, which adds a logical structure tree on top of the visual layer — but tagging is optional and rarely present in practice
PDF/A (ISO 19005, 2005–present): archival subset; mandates embedded fonts and prohibits encryption, but does not require structural tagging
PDF 2.0 / ISO 32000-2 (2017): improved accessibility support, mandatory tagging guidelines — but adoption is still limited

The consequence is that most PDFs in the wild today — scanned invoices, exported reports, government forms — carry no structural metadata. They are images of text, or streams of glyphs without paragraph or heading markers. A converter has to reconstruct that structure from visual heuristics: font size, position, whitespace, line breaks.

That reconstruction is where tools diverge.

The Three-Tier Tool Landscape

markitdown — Fast, Text-Layer Only

markitdown (Microsoft, available as an MCP server) uses pdfminer.six as its PDF backend. pdfminer.six extracts the character stream from the PDF’s content layer and reassembles it into text. It is fast, pure Python, and requires no external dependencies.

The limitation is fundamental: it produces flat text with no heading markers whatsoever. If the PDF has no text layer — which is the case for any scanned document, any image-based export — it returns nothing or garbage. It cannot OCR.

Use markitdown when you have clean, text-layer PDFs and you only need the raw content without structure. For anything more complex, it falls short.

pymupdf4llm — Structure Detection with OCR Fallback

pymupdf4llm uses PyMuPDF as its base and adds a heading-detection layer that reads font metrics from the PDF’s internal glyph data. Lines rendered in a larger font become ## headings. It also falls back to Tesseract OCR when no text layer is present, making it usable on scanned documents.

The OCR fallback has a significant limitation: once you are in OCR territory, font size information is gone — all headings land at ## because the OCR engine cannot infer visual weight from pixels reliably. You get a flat heading hierarchy rather than a true # / ## / ### structure.

For PDFs with a text layer, pymupdf4llm produces noticeably better output than markitdown: you get actual heading markers, and the structural reconstruction is solid. It runs in an isolated virtualenv (system Python is protected on modern systems):

uv venv && uv pip install pymupdf4llm

uv run python -c "
import pymupdf4llm
md = pymupdf4llm.to_markdown('document.pdf')
with open('output.md', 'w') as f:
    f.write(md)
"

Claude Vision — Last Resort for Full Hierarchy

When OCR is involved and heading hierarchy matters, Claude Vision can read the rendered page as an image and infer heading levels from visual weight — something no text-extraction tool can do reliably. It produces the most structurally accurate output.

The cost is real: each page is sent as an image, and token consumption scales linearly with page count. A 50-page document through Claude Vision costs roughly 10–20× what pymupdf4llm costs. Reserve it for documents where structural fidelity is a hard requirement and page count is manageable.

Tool Comparison

Tool	Backend	Requires text layer?	Heading detection	Relative cost
`markitdown`	pdfminer.six	Yes	None (flat text)	Low
`pymupdf4llm`	PyMuPDF + Tesseract	No (OCR fallback)	`##` only (flat hierarchy on OCR)	Low–Medium
Claude Vision	Vision API	No	Full `#`/`##`/`###` hierarchy	High

The Part Everyone Skips: Post-Conversion Cleaning

PDF footers — page numbers, company addresses, horizontal rules, legal disclaimers — are extracted as regular text. They appear once per page throughout your Markdown output. An agent reading that document will encounter the footer noise repeatedly, mixed into the content.

Cleaning is not optional. It is a required step in any production pipeline.

Standalone page numbers are the easy case:

sed -i '/^\s*[0-9]\+\s*$/d' output.md
# verify:
grep -c '^\s*[0-9]\+\s*$' output.md   # must return 0

Complex footers — a separator line followed by a company name and URL — need a regex. The pattern is always the same: identify what the footer looks like in the raw output (grep for a distinctive string), then remove it:

import re

with open('output.md') as f:
    text = f.read()

# Example: footer pattern from a real project
text = re.sub(r'_{10,}.*?Company Name.*?company\.com\n?', '', text, flags=re.DOTALL)

with open('output.md', 'w') as f:
    f.write(text)

The general pipeline is:

Convert PDF → raw Markdown
Grep for the repeating footer pattern
Strip it with sed or Python re.sub
Verify with grep -c <pattern> returning 0

Choosing the Right Tool

The decision tree is short:

Does the PDF have a selectable text layer? (Open it in a viewer and try to select text.) If yes, use pymupdf4llm. If structural hierarchy doesn’t matter, markitdown is faster.
Is the PDF image-based (scanned)? Use pymupdf4llm for its OCR fallback. Accept flat heading hierarchy.
Is heading hierarchy critical and page count under ~30? Use Claude Vision.
Is heading hierarchy critical and page count high? Consider post-processing heuristics first: promote lines matching ^\d+\.\s to ##, ^\d+\.\d+\.\s to ###. It covers the most common cases at zero extra cost.

Both conversions can run in parallel when processing multiple PDFs — there is no shared state to worry about.

We Packaged This Into a Skill

Everything above — tool selection, OCR fallback, footer cleaning, verification — is now packaged into a public agent skill: /utils:pdf.

Install it from the Trobz public skills repository and invoke it directly from your agent session. Point it at a PDF and it picks the right tool, runs the conversion, strips repeating footer patterns, and hands you clean Markdown ready to feed into your agent pipeline.

Next in the series: Part 2 — PowerPoint Decks.