How PDF Remediation Works

Modified on: Wed, 24 Jun, 2026 at 2:00 PM

Summary

PDF remediation in Siteimprove works by automatically analyzing documents for accessibility issues and applying fixes—such as adding tags, structure, and headings—to make them compliant with standards like WCAG and PDF/UA. The process reduces manual effort by combining automation with optional human validation to ensure accuracy and usability.

Overview

PDF remediation is the process of transforming a PDF into an accessible, machine-readable document that can be interpreted by assistive technologies like screen readers. In Siteimprove, this process is handled through an automated workflow that identifies accessibility issues, applies structural fixes (such as tagging, reading order, and metadata), and outputs a remediated document aligned with accessibility standard

This approach streamlines what is traditionally a manual and time-intensive task by enabling organizations to remediate PDFs at scale while maintaining compliance and improving usability for all users.

Choose how the agent remediates your PDF

When you start a remediation, you choose how Siteimprove should approach your PDF. You can pick one of two modes: Fix gaps or Start fresh. Both modes run the same underlying fixes for things like metadata, language, fonts, structure repair, and bookmarks. The difference is in how the agent treats the tags that are already in your PDF.

Fix gaps

Use Fix gaps when your PDF is already tagged and you want to keep what’s there. The agent will leave your existing tags in place and fill the gaps. If a heading is already correctly tagged, the agent will preserve it. If a figure already has alt text, the agent will preserve it. If the document already has a title or a language defined, the agent will preserve it. Fix gaps is the right choice for PDFs that were authored with accessibility in mind (for example, exported from a properly structured Word document or InDesign file) but still have a few accessibility issues to address.

Start fresh

Use Start fresh when your PDF is untagged, badly tagged, or when you want a clean start. The agent will auto-tag the document from scratch and overwrite the existing tag tree. Headings are inferred from font style and renumbered into a clean H1–H6 hierarchy. Lists, links, and tables are re-tagged based on their visual structure. The reading order is set from the document’s structure rather than the existing tags. Start fresh is the right choice for PDFs without tags, or tagged so poorly that Fix gaps would leave too many problems behind. Which mode should I choose? If you’re not sure, start with Fix gaps. The agent will preserve any good tagging that’s already there and only fix what’s broken. If you’re unhappy with the result, for example, headings are missing or the reading order is wrong, run the document again with Start fresh for a clean re-tag.

What the agent will do for you

Whichever mode you choose, the agent runs a set of fixes that cover document properties, the reading structure that assistive technology uses, the content and its alternatives, how users navigate the document, and the under-the-hood quality and compliance checks.

Here’s what the agent will do.

Document properties

The information that identifies, locates, and describes the PDF. This is the first thing assistive technology reads when a document opens.

Document title

The agent will set the document’s title, the human-readable name a screen reader announces when the document opens, and the name shown in the browser tab or PDF viewer’s title bar.

If your source PDF didn’t have a title set, the agent will pick one from the document content using a cascading fallback: first H1 heading, then H2, then H3, then any heading tag, then the first paragraph, then the filename. If your source already has a title, the agent will preserve it. The agent will also turn on the “Display Title in title bar” flag so PDF viewers show the title rather than the filename.

Document language

The agent will set the document’s primary language and add inline language attributes on individual content elements where they’re missing. The document language tells assistive technology which pronunciation engine to use; without it, screen readers default to the user’s system language, which may pronounce content incorrectly.

PDF/UA-1 standard

The agent will mark the PDF as targeting PDF/UA-1, the international standard for accessible PDFs. This flag signals to anyone receiving your PDF that it was built with accessibility in mind, making it easier to meet procurement requirements, internal policies, or legal obligations that reference PDF/UA-1 by name.

Reading structure

The tag tree that screen readers use to read your document. Without good tagging, a PDF is just shapes on a page to assistive technology.

Headings

The agent will detect headings from font style (size, weight) and from leading numbering patterns, then assign the correct level (H1 through H6) and renumber the sequence so there are no skipped levels. Skipped heading levels are one of the most common reasons screen reader users get lost in a document.

Tables

Tables in PDFs need to be structured so screen readers know which cells are column or row headers, and which are data. The agent repairs table structure, fixes row and column spans, and identifies and sets header cells where the layout makes them clearly identifiable. Tables where headers could not be reliably determined are flagged for review.

Reading order

Every PDF has an internal index that links each piece of content to its structural tag. When this index is broken or out of sync, screen readers can lose their place, skip content, or read sections in the wrong order. The agent rebuilds this index from scratch, running the repair once at the start of remediation and once again after all other changes have been applied, so the final reading order is consistent and correct.

Content and alternatives

The non-text content (figures, form fields, annotations) and the descriptions that make them usable for screen-reader users.

Alternative text (coming in August 2026)

Applying and generating alternative text for figures directly in the platform is coming in a future release. For now, images and figures that are missing descriptive alt text are listed in the Issues section so your team can add descriptions where needed.

Annotations

The agent will tag previously-untagged annotations (form fields, buttons, comments, links) and set a Contents key on each one so screen readers can announce what it is (for example, “Submit button,” “Email address field”). Popup annotations are excluded because they typically duplicate other metadata.

Navigation

How users, particularly keyboard and screen-reader users, move through the document.

Bookmarks

The agent will build a bookmark hierarchy (H1 through H6) from the document’s headings, so even a long document becomes navigable in seconds from the bookmarks panel. If your source already has a curated set of bookmarks, the agent will preserve them.

Tab order

The agent will set the tab order to follow the document’s structural reading order rather than the elements’ physical position on the page. This is a PDF/UA requirement and prevents bugs where pressing Tab jumps the user to the wrong part of the page.

Quality and compliance

The under-the-hood repairs that make your PDF render cleanly, validate against accessibility standards, and behave consistently across viewers.

Fonts

The agent will embed all fonts used in the document. The file is slightly larger as a result, but text extraction becomes reliable across every viewer. When fonts aren’t embedded, screen readers can produce garbled or missing text.

Metadata

Every PDF carries a block of background information: the date it was created, the software that produced it, and compliance flags. When this information is corrupted, PDF viewers and platforms that index or share documents may fail to open or display the file correctly. The agent repairs this block without overwriting existing valid information. It also removes an internal flag that marks the tagging as unverified, signaling that the document’s structure has been intentionally reviewed.

Required tag IDs

Footnotes and table header cells are required by the PDF/UA-1 standard to have unique reference codes. These codes allow screen readers to reliably locate and navigate back to specific footnotes or table headers within the document. The agent assigns these codes where they are missing.

Reflow attributes

The agent records the physical position of each image, figure, form field, and table on the page. This positional information is needed for reflow (the ability to rearrange a PDF’s content on a small screen or when the reader is zoomed in significantly) without losing the correct reading order. It also prepares your PDF for the next generation of accessibility standards.

Cleanup

PDFs often accumulate structural clutter: empty elements with no content, technical markers left over from print production workflows (including TrapNet and PrinterMark annotations), and decorative content that carries no meaning for someone using a screen reader. The agent removes the empty and print-production clutter, marks decorative content as invisible to assistive technology so it is skipped during navigation, and fixes missing spaces between words that can cause screen readers to run words together.

What still needs your review

After the agent finishes, a Review section shows the changes it made to key document properties so you can confirm they look right before downloading. Reviewing is optional; you can download the remediated PDF at any time.

The Review panel shows:

Title: the document title set in the PDF properties, which is what screen readers announce when the file is opened.
Language: the primary language of the document, which tells screen readers how to pronounce the content correctly.

If either value looks wrong, you can correct it before downloading. Once you are satisfied, download the remediated PDF and re-publish it to your site.

Environment / Applicability

Platform: Siteimprove
Feature: PDF Remediation Agent
Use Case: Automated PDF accessibility remediation

Did you find it helpful? Yes No