PDFs don't store paragraphs. They store individual positioning commands that paint characters onto a coordinate grid. Understanding this is the key to building an editor.
When you see "Hello World" in a PDF, you might assume it's stored as a string. It's not. It's stored as a series of drawing commands in a PostScript-like language called a content stream.
Each operator changes the graphics state machine. Tf sets the font, Td moves the cursor, Tj paints characters at the current position. The PDF viewer replays these commands to reconstruct the page.
| Operator | Purpose | In our corpus |
|---|---|---|
| BT / ET | Begin / end a text object block | every text operation |
| Tf | Set font and size: /F1 12 Tf | once per text run |
| Tm | Set absolute text matrix (position + transform) | 917,874 |
| Td | Move text position relative to current | 8,848 |
| Tj | Show a text string | 925,805 |
| TJ | Show text with per-glyph kerning adjustments | 8,830 |
| Tc / Tw | Character spacing / word spacing | varies |
| cm | Concatenate transformation matrix (coordinate transform) | on 16.6% of pages |
We analyzed 10 real-world PDFs: pitch decks, academic papers, company handbooks, financial reports, government proposals. Here's what the data says about building an editor.
Every PDF embeds the actual font files used in the document. But here's the catch: to keep file sizes small, PDF generators strip out every letter shape they don't need. If your document never uses the letter "z", the embedded font won't contain it. This is called subsetting.
Think of it like packing for a trip. You don't bring your entire wardrobe — you bring only what you'll wear. A font named ABCDEF+Arial (the random prefix is the giveaway) might only contain 40 of Arial's 3,000+ characters. The rest were left behind.
This rules out the simplest editing trick: covering the old text with a white box and stamping new text on top with a different font. Since nearly every font in the document is a stripped-down subset, a replacement font's letter spacing, weight, and sizing would clash visibly. The text would look obviously tampered with.
The only way to edit cleanly is to reuse the exact font that's already embedded in the document. And if the user tries to type a character the font doesn't contain? We have to tell them it's not available.
The constraint: Our editor can only offer characters that already exist in the embedded font. This sounds limiting, but in practice most subsetted fonts contain the full Latin alphabet and common punctuation — more than enough for typical edits like fixing a typo or updating a name.
Here's something that trips up nearly everyone who tries to edit a PDF for the first time: the raw bytes that represent text are not Unicode, not ASCII, and not human-readable. Each font carries its own private translation table — called an encoding — that maps raw byte values to letter shapes.
Some fonts use standard, well-known translation tables where byte 0x48 means "H" just like you'd expect. But others define completely custom mappings where byte 0x01 might mean "H", 0x02 means "e", and so on. The mapping is arbitrary — chosen by whichever software created the PDF.
This creates a two-way problem. To read existing text, we look up each byte in the font's translation table and get the letter. But to write new text back, we need to run the table in reverse: start with the letter the user typed and figure out which byte value the font uses to represent it. Reading is a dictionary lookup. Writing is searching the dictionary backwards.
The good news: 77% of fonts in our corpus use standard or simple translation tables that are well-documented and straightforward to reverse. The remaining 22.6% use a complex scheme called CID encoding, common in East Asian languages and some modern PDF generators. CID fonts use multi-byte lookup tables that are significantly harder to reverse-engineer.
Decision: v1 handles the 77% of fonts with simple encodings. Text blocks using CID fonts are displayed normally but marked as non-editable. Expanding to CID support is a well-defined follow-up — hard, but scoped.
You might expect a PDF to store "The quick brown fox" as a single instruction: draw this text at this position. In practice, most PDF generators do something far more verbose. They issue a separate positioning command for every single word — sometimes for every letter — each placed at an absolute coordinate on the page.
Each word gets its own exact pixel coordinate. This makes a four-word sentence consume eight instructions instead of two. An entire paragraph might be 50+ instructions. Why? Because PDF generators don't trust relative spacing — they compute the exact position of every word and hard-code it. This guarantees pixel-perfect output regardless of which PDF viewer renders it.
We counted every drawing instruction across 1,438 pages. The two most common instructions — "move to exact position" and "draw text" — appear in an almost perfect 1:1 ratio, confirming the pattern: one position, one word, repeat.
The PDF spec actually defines a more efficient format: a single "draw with spacing" command that renders an entire line of text with precise letter-by-letter positioning baked in. But almost no generators use it — less than 1% in our corpus. And 86% of text blocks contain multiple separate draw commands rather than a single one:
What this means for editing: When a user changes a paragraph, we can't just swap one string — we have to rip out the entire cluster of positioning-and-drawing commands and replace them. Our approach: collapse all those instructions down to a single position + a single draw command that uses the font's built-in width table to space letters correctly. Fewer instructions, cleaner file, same visual result.
Putting the research together: 77% of fonts use simple encodings we can reverse, 97% are subsetted so we must reuse the embedded font, and 86% of text blocks are fragmented into dozens of positioning commands we'll need to consolidate. The path to editing is clear — but there's a prerequisite.
A PDF has no concept of a "paragraph" or even a "word." What it gives us is a flat list of individually-positioned characters scattered across the page. To let a user click on a block of text and edit it, we first need to figure out which characters belong together.
Think of it like receiving a box of refrigerator magnets dumped on a table. Each letter has an exact position, but nobody labeled which letters form words, which words form sentences, or which sentences form paragraphs. We have to reconstruct that structure from the positions alone.
The algorithm is spatial clustering. We measure the gaps between characters and use simple distance thresholds to group them. Characters that are close horizontally form words. Words on the same vertical line form lines. Lines with consistent spacing form blocks. Click through each step below:
Each detected block gives us everything we need to power an editor:
Once we have blocks, editing becomes mechanical: the user clicks a block, we show them the text in an editor, they change it, we re-encode the new text using the same font, calculate letter positions from the font's width table, and write a replacement set of drawing instructions back into the PDF. The hard part was always figuring out which characters go together.
Unlike commercial editors that ship multi-megabyte C++ engines compiled to WebAssembly, our constrained approach works as pure TypeScript in the browser. The key insight: we only need to rewrite the operators for the selected block, not re-render the entire document.
Canvas rendering (pdf.js) + block overlay + inline text editor
Content stream parser + text matrix tracker + spatial clustering
Encoding maps (/ToUnicode, /Differences) + /Widths table for glyph metrics
Decompress → locate block operators → encode new text → recalculate positions → recompress
Parse xref + read objects via pdf-lib → incremental save (append, don't rewrite)
The constraint that makes this possible: by only allowing edits with the document's own embedded font and only rewriting the selected block's operators, we sidestep font subsetting and document-wide reflow — the two hardest problems in PDF editing.