← Back to Blog
AI by starFeatured

Convert Everything to Markdown for AI Agents — Part 2: PowerPoint Decks

Convert Everything to Markdown for AI Agents — Part 2: PowerPoint Decks

PPTX is structured XML, so it should be easy to read — and it still isn't. A deck's meaning is spatial: split across slides, hidden in speaker notes, locked in charts and image-only slides. Here's how to convert one to clean Markdown without dropping it.

Key Takeaways: Markdown is the input format agents read best, and a PowerPoint deck is harder to convert than it looks. PDF is hard because it has no structure; PPTX is hard for the opposite reason — it is structured XML, but a deck’s meaning is spatial and spread across slides, a separate speaker-notes stream, tables, charts, and image-only slides. A clean conversion preserves slide boundaries, pulls notes inline, renders tables and charts as data, and never silently drops an image. The right approach depends on the deck: text-heavy decks convert to Markdown directly; graphics-heavy decks need to be rendered to images for Vision. Everything below is packaged into the /utils:pptx skill.

Where This Fits

In Part 1 we covered converting PDFs to Markdown — what the tools are, where they fail, and why post-processing is a required step rather than optional cleanup. This series is about getting common file formats into the one shape agents read cleanly: plain text with unambiguous structure. (If you want the background on why we package these workflows as installable skills at all, see What Agent Skills Are.)

This part is PowerPoint. It is the format people reach for when they want to present an idea — which is exactly what makes it awkward to read back out as text.

PPTX Is Structured — and Still Hard to Read

PDF was a 30-year exercise in layout-first design: it describes glyphs at coordinates, not headings and paragraphs, so a converter has to reconstruct structure from visual heuristics. PowerPoint looks like the opposite problem solved. A .pptx file is a zip of OOXML — Office Open XML — and the structure is right there in the markup: slides, placeholders, bullet outline levels, tables, charts, speaker notes. You would expect it to fall straight out into Markdown.

It mostly doesn’t, because a deck’s meaning isn’t in the XML tree — it’s in the arrangement. Four things get lost in a naive dump:

  • Slide boundaries dissolve. Concatenate every slide’s text and you lose the single most important structural signal a deck has: where one idea ends and the next begins. An agent reasoning over the result can’t tell slide 3 from slide 4.
  • Speaker notes live in a separate stream. The real argument is often in the notes, not on the slide. They’re stored apart from the slide body, so a converter that only walks the visible placeholders throws away half the content.
  • Tables and charts are data, not prose. A chart is a series of numbers with a title; a table is rows and columns. Flatten either into a run of text and the relationships — which value belongs to which category — are gone.
  • Some slides are only images. A diagram, a screenshot, an architecture drawing — there is no text to extract at all. A text-only pass returns an empty slide and the agent never learns the slide existed.

So the work isn’t parsing the deck. The parser is easy. The work is deciding, slide by slide, what each one actually contains and emitting it in a form an agent can use.

What an Agent Actually Needs From a Deck

A good conversion produces a consistent, parseable block per slide:

  • An unambiguous slide anchor — every slide gets a <!-- Slide number: N --> marker, so position survives the conversion and the agent can reference “slide 7” precisely.
  • Speaker notes inline — pulled in under each slide’s content, behind a consistent ### Notes: header that’s emitted even when empty, so the structure never shifts.
  • Tables as GitHub-flavoured Markdown tables, and charts as data tables with the chart title as a header row and one column per series — the numbers preserved, not narrated.
  • Images referenced, never dropped — every embedded image is extracted and given an inline > [IMAGE: …] marker at the point it appeared. Even when extraction is turned off, the marker stays so the agent knows content is missing.
  • A Vision fallback for graphics-only slides — when a slide is pure illustration, the answer is to rasterise it to PNG and let the modelA mathematical function trained on data that maps inputs to outputs. In ML, a model is the artifact produced after training — it encapsulates learned patterns and is used to make predictions or… read it as an image instead of pretending there’s text.

Here is that in practice. A single slide — title, two bullet groups, two illustrations:

A PowerPoint slide titled Essential Communication Skills, with Active Listening and Open-Mindedness bullet groups and two illustrations
Figure 1 — A source slide: a title, two bullet groups, and two illustrations.

…becomes a clean Markdown block — the slide anchor, the bullet hierarchy by outline level, both images referenced rather than dropped, and the ### Notes: header emitted even when empty:

The same slide converted to Markdown: a Slide number 61 anchor, nested bullets, two IMAGE reference markers, and a Notes header
Figure 2 — The same slide as Markdown: a slide-number anchor, nested bullets, both images referenced, and a Notes header.

That last point is the key difference from PDF: PowerPoint conversion is not one pipeline. It’s a small set of operations, and you pick by what the deck is made of.

Five Operations

Before you run anything, check the prerequisites — most operations need only one tool:

  • uv — required for every operation; it installs the Python dependencies on first run.
  • libreoffice + poppler — only for render (the deck is converted to PDF, then rasterised).
  • tesseract — only when you pass --ocr (on markdown or images).

The skill exposes five focused operations:

  • markdown (the default) — convert the whole deck to Markdown: slide markers, bullet hierarchy by outline level, notes inline, tables and charts as data, images extracted and referenced. This is the right choice for any text-heavy deck.
  • outline — a fast table-of-contents scan: each slide’s number, title, whether it has body content, its image count, and whether it has notes. Use it to decide which slides are worth drilling into on a long deck before you extract everything.
The outline operation output: a numbered list of slide titles, each tagged with content, image count, and notes status
Figure 3 — The outline operation: one line per slide, tagged with content, image count, and notes status.
  • tables — pull just the tables out, as text or one CSV per table, when the deck is really a carrier for structured data.
  • images — dump the embedded images on their own, optionally OCR’d into sidecar text, when you want the illustrations separately from a conversion.
  • render — rasterise each slide to PNG (via LibreOffice headless, then poppler) for Vision processing. This is the path for graphics-heavy decks and for any slide that came back empty from a text pass. It is also the most expensive operation: each slide is sent to the modelA mathematical function trained on data that maps inputs to outputs. In ML, a model is the artifact produced after training — it encapsulates learned patterns and is used to make predictions or… as an image, so tokenThe basic unit of text processed by an LLM. A token is roughly 4 characters or 0.75 words in English. LLMs process and generate text as sequences of tokens. Tokenization varies by model and language. cost scales with slide count — reserve it for graphics-heavy decks or the handful of slides a text pass left blank, not the whole deck by default.

One category reliably lands in that “came back empty” bucket: SmartArt and grouped shapes carry no extractable text, so a text pass emits nothing for them — treat them as image-only slides and render. Embedded video, audio, and animation aren’t captured at all; the skill extracts text and structure, not media.

The decision tree the skill follows:

  • Text-heavy deck → markdown.
  • Want a quick TOC to decide where to look → outline.
  • Need the structured tabular data → tables.
  • Need the embedded illustrations or screenshot OCR → images.
  • Deck is mostly graphics and you need Vision → render.
  • Mixed deck (some text slides, some image-only) → run markdown first, then follow up with render --slides <those numbers> for the slides that came out empty.

A conversion run looks like this:

# Full deck → Markdown, images extracted alongside, notes included
uv run --project "${CLAUDE_PLUGIN_ROOT}/skills/pptx/scripts" \
  python "${CLAUDE_PLUGIN_ROOT}/skills/pptx/scripts/pptx-to-markdown.py" \
  deck.pptx --output deck.md

Flags let you scope to a slide range (--slides 1-5), drop notes (--no-notes), point the image directory somewhere specific (--images-dir ./imgs), or OCR every extracted image inline (--ocr).

The other operations follow the same shape — only the script name changes. Scan a long deck first:

# Slide-by-slide TOC: title, content/notes flags, image count
uv run --project "${CLAUDE_PLUGIN_ROOT}/skills/pptx/scripts" \
  python "${CLAUDE_PLUGIN_ROOT}/skills/pptx/scripts/pptx-outline.py" \
  deck.pptx

Then, for the graphics-heavy slides that scan flags as empty, fall back to Vision:

# Rasterise slides 4 and 7 to PNG at 200 DPI for Vision
uv run --project "${CLAUDE_PLUGIN_ROOT}/skills/pptx/scripts" \
  python "${CLAUDE_PLUGIN_ROOT}/skills/pptx/scripts/render-slides.py" \
  deck.pptx --output-dir ./png --slides 4,7 --dpi 200

The Part Everyone Skips: Verification

In Part 1 the step everyone skips was stripping repeating PDF footers. PowerPoint has its own equivalent: presentation chrome and silently dropped content.

Page-number tokens (‹#›), date placeholders, and footer text are part of the slide master, not the content — they should never end up in the Markdown. And because images are easy to lose, you want to confirm every embedded one actually made it into the output. Two greps catch both:

# Page-number chrome should be fully stripped — returns 0 if clean
grep -c '‹#›' deck.md

# Every embedded image should be referenced — compare against the deck's image count
grep -c '> \[IMAGE:' deck.md

If the first command returns anything but 0, master-slide chrome leaked into your content. If the second doesn’t match the number of images in the deck, something was dropped. Verification is part of the pipeline, not an afterthought — the same discipline that made PDF conversion reliable.

We Packaged This Into a Skill

Everything above — the five operations, the decision tree, slide anchors, notes inline, charts as data, image references, the verification greps — is packaged into a public agent skill: /utils:pptx.

Install it from the Trobz public skills repository and invoke it directly from your agent session. Point it at a .pptx and it picks the operation, runs the conversion, and hands you clean Markdown ready to feed into your agent pipeline. It needs uv for the Python scripts; the render operation additionally needs libreoffice and poppler, and --ocr needs tesseract — the skill tells you which is missing if one isn’t installed.

That’s Part 2. Part 1 covered PDF files; if you’re new to the idea of capturing a workflow as an installable, reviewable skill, start with What Agent Skills Are. More formats are on the way in this series — Word documents are next.

Ready to put AI to work?

Let's explore how Trobz AI can automate your processes, enhance your ERP, and help your team make better decisions — faster.