Research Notes

How PDFs actually
store text

PDFs don't store paragraphs. They store individual positioning commands that paint characters onto a coordinate grid. Understanding this is the key to building an editor.

0
pages analyzed
0
unique fonts
97%
fonts subsetted

The Illusion

What you see vs. what's stored

When you see "Hello World" in a PDF, you might assume it's stored as a string. It's not. It's stored as a series of drawing commands in a PostScript-like language called a content stream.

Content Stream Parser
font: none   size: --   pos: (0, 0)

Each operator changes the graphics state machine. Tf sets the font, Td moves the cursor, Tj paints characters at the current position. The PDF viewer replays these commands to reconstruct the page.

Operator Reference

OperatorPurposeIn our corpus
BT / ETBegin / end a text object blockevery text operation
TfSet font and size: /F1 12 Tfonce per text run
TmSet absolute text matrix (position + transform)917,874
TdMove text position relative to current8,848
TjShow a text string925,805
TJShow text with per-glyph kerning adjustments8,830
Tc / TwCharacter spacing / word spacingvaries
cmConcatenate transformation matrix (coordinate transform)on 16.6% of pages

Phase 1 Research

What we found in 1,438 pages

We analyzed 10 real-world PDFs: pitch decks, academic papers, company handbooks, financial reports, government proposals. Here's what the data says about building an editor.

The Font Problem

PDFs carry their own fonts — but only the parts they need

Every PDF embeds the actual font files used in the document. But here's the catch: to keep file sizes small, PDF generators strip out every letter shape they don't need. If your document never uses the letter "z", the embedded font won't contain it. This is called subsetting.

Think of it like packing for a trip. You don't bring your entire wardrobe — you bring only what you'll wear. A font named ABCDEF+Arial (the random prefix is the giveaway) might only contain 40 of Arial's 3,000+ characters. The rest were left behind.

97% subsetted
Subsetted fonts — 468 of 482
Full fonts — 14 of 482

This rules out the simplest editing trick: covering the old text with a white box and stamping new text on top with a different font. Since nearly every font in the document is a stripped-down subset, a replacement font's letter spacing, weight, and sizing would clash visibly. The text would look obviously tampered with.

The only way to edit cleanly is to reuse the exact font that's already embedded in the document. And if the user tries to type a character the font doesn't contain? We have to tell them it's not available.

The constraint: Our editor can only offer characters that already exist in the embedded font. This sounds limiting, but in practice most subsetted fonts contain the full Latin alphabet and common punctuation — more than enough for typical edits like fixing a typo or updating a name.

The Translation Problem

The bytes in a PDF aren't the letters you see

Here's something that trips up nearly everyone who tries to edit a PDF for the first time: the raw bytes that represent text are not Unicode, not ASCII, and not human-readable. Each font carries its own private translation table — called an encoding — that maps raw byte values to letter shapes.

Some fonts use standard, well-known translation tables where byte 0x48 means "H" just like you'd expect. But others define completely custom mappings where byte 0x01 might mean "H", 0x02 means "e", and so on. The mapping is arbitrary — chosen by whichever software created the PDF.

% Standard encoding — predictable, easy to work with byte 0x48H    byte 0x65e    byte 0x6Cl % Custom encoding — same letters, totally different bytes byte 0x01H    byte 0x02e    byte 0x03l

This creates a two-way problem. To read existing text, we look up each byte in the font's translation table and get the letter. But to write new text back, we need to run the table in reverse: start with the letter the user typed and figure out which byte value the font uses to represent it. Reading is a dictionary lookup. Writing is searching the dictionary backwards.

Standard tables
65.4%
315
Complex (CID)
22.6%
109
Windows standard
12.0%
58
TrueType
3.9%
19

The good news: 77% of fonts in our corpus use standard or simple translation tables that are well-documented and straightforward to reverse. The remaining 22.6% use a complex scheme called CID encoding, common in East Asian languages and some modern PDF generators. CID fonts use multi-byte lookup tables that are significantly harder to reverse-engineer.

Decision: v1 handles the 77% of fonts with simple encodings. Text blocks using CID fonts are displayed normally but marked as non-editable. Expanding to CID support is a well-defined follow-up — hard, but scoped.

The Fragmentation Problem

A sentence is stored as dozens of separate commands

You might expect a PDF to store "The quick brown fox" as a single instruction: draw this text at this position. In practice, most PDF generators do something far more verbose. They issue a separate positioning command for every single word — sometimes for every letter — each placed at an absolute coordinate on the page.

What you'd expect
% start text set font Helvetica, 12pt move to x:50 y:700 draw "The quick brown fox" % end text % One position, one draw. % Simple to find, simple to replace.
What PDF generators actually produce
% start text set font Helvetica, 12pt move to x:50 y:700 draw "The" move to x:78 y:700 draw "quick" move to x:122 y:700 draw "brown" move to x:168 y:700 draw "fox" % end text

Each word gets its own exact pixel coordinate. This makes a four-word sentence consume eight instructions instead of two. An entire paragraph might be 50+ instructions. Why? Because PDF generators don't trust relative spacing — they compute the exact position of every word and hard-code it. This guarantees pixel-perfect output regardless of which PDF viewer renders it.

We counted every drawing instruction across 1,438 pages. The two most common instructions — "move to exact position" and "draw text" — appear in an almost perfect 1:1 ratio, confirming the pattern: one position, one word, repeat.

Draw text
925,805
99%
Set exact position
917,874
98%
Draw with spacing
8,830
<1%
Relative move
8,848
<1%

The PDF spec actually defines a more efficient format: a single "draw with spacing" command that renders an entire line of text with precise letter-by-letter positioning baked in. But almost no generators use it — less than 1% in our corpus. And 86% of text blocks contain multiple separate draw commands rather than a single one:

86.3%
blocks with multiple draw commands
31,889 blocks
13.7%
blocks with a single draw command
5,069 blocks

What this means for editing: When a user changes a paragraph, we can't just swap one string — we have to rip out the entire cluster of positioning-and-drawing commands and replace them. Our approach: collapse all those instructions down to a single position + a single draw command that uses the font's built-in width table to space letters correctly. Fewer instructions, cleaner file, same visual result.

So What Can We Actually Edit?

Most of the text. But we need to solve one more problem first.

Putting the research together: 77% of fonts use simple encodings we can reverse, 97% are subsetted so we must reuse the embedded font, and 86% of text blocks are fragmented into dozens of positioning commands we'll need to consolidate. The path to editing is clear — but there's a prerequisite.

A PDF has no concept of a "paragraph" or even a "word." What it gives us is a flat list of individually-positioned characters scattered across the page. To let a user click on a block of text and edit it, we first need to figure out which characters belong together.

Think of it like receiving a box of refrigerator magnets dumped on a table. Each letter has an exact position, but nobody labeled which letters form words, which words form sentences, or which sentences form paragraphs. We have to reconstruct that structure from the positions alone.

Reconstructing Structure

Characters → Words → Lines → Editable Blocks

The algorithm is spatial clustering. We measure the gaps between characters and use simple distance thresholds to group them. Characters that are close horizontally form words. Words on the same vertical line form lines. Lines with consistent spacing form blocks. Click through each step below:

Each character is extracted with its exact (x, y) position on the page

The Distance Rules

// Are two characters part of the same word? if horizontal gap < font size * 0.3 → same word // Are two words on the same line? if they overlap vertically > 50% AND horizontal gap < font size * 2 → same line // Are two lines part of the same block? if vertical gap matches the line spacing AND they overlap horizontally > 50% → same block

What We Get Out

Each detected block gives us everything we need to power an editor:

text: "The quick brown fox" position: { x: 50, y: 700 } size: { width: 340, height: 48 } font: ABCDEF+Helvetica, 12pt encoding: WinAnsi (reversible) editable: true characters: 19 chars with positions instructions: operators 42-58 in stream

Once we have blocks, editing becomes mechanical: the user clicks a block, we show them the text in an editor, they change it, we re-encode the new text using the same font, calculate letter positions from the font's width table, and write a replacement set of drawing instructions back into the PDF. The hard part was always figuring out which characters go together.

System Design

Five layers, no WASM needed

Unlike commercial editors that ship multi-megabyte C++ engines compiled to WebAssembly, our constrained approach works as pure TypeScript in the browser. The key insight: we only need to rewrite the operators for the selected block, not re-render the entire document.

Browser UI

Canvas rendering (pdf.js) + block overlay + inline text editor

Ξ

Block Detector

Content stream parser + text matrix tracker + spatial clustering

A

Font Decoder

Encoding maps (/ToUnicode, /Differences) + /Widths table for glyph metrics

«»

Content Stream Rewriter

Decompress → locate block operators → encode new text → recalculate positions → recompress

PDF I/O

Parse xref + read objects via pdf-lib → incremental save (append, don't rewrite)

v1 Editability

Standard-encoded text (77%)
Axis-aligned text
Existing font glyphs only
Single-block edits
×
Identity-H / CID fonts (22.6%)
×
Rotated or skewed text
×
Form fields
×
Images & vector graphics

The constraint that makes this possible: by only allowing edits with the document's own embedded font and only rewriting the selected block's operators, we sidestep font subsetting and document-wide reflow — the two hardest problems in PDF editing.