Why are some PDFs not being checked by Siteimprove?
Customers who subscribe to the Quality Assurance and/or the Accessibility module can have PDFs checked for broken links and accessibility issues.
Info: The number of PDFs listed in Accessibility is normally less than the number of PDFs found in the QA PDF Inventory. QA Inventory lists all PDFs identified during the crawl process but Accessibility lists only internal (locally hosted) PDFs without any of the limitations listed below.
If you find that some PDFs on your website are not being checked by Siteimprove, then we recommend you consider the following:
Note: Only Administrators and Account Owners can edit crawl settings.
Site content settings
PDFs may also be removed from the check w Site Content Settings. Review the URL matches added and compare them with the PDF URLs.
- Go to Quality Assurance > Inventory > Pages
- Select the site that you are investigating
- To edit these settings, please refer to the article “Site Content Settings”
Note: Changing these settings can result in a variation in the number of pages being checked on your website.
There can be other reasons why PDFs are not checked, for example:
- PDF checking must be included in your subscription and you must be within the number of documents. This information is available in your Siteimprove agreement.
- PDFs over 20 MB will not be checked
- PDFs need to be the correct MIME type. PDFs that do not identify themselves with the value 'application/pdf' will not be checked
- Your website's Robots.txt file may direct our crawler not to check a section of the website
- PDF stopped by firewalls or behind an authentication
- PDFs that are images will not be checked
- PDFs need to be machine-readable
- A PDF will not be crawled if the PDF Link is broken (e.g. gives an HTTP 404 error response)
- If your site is set up to crawl your XML-sitemap only, and if PDFs are not shown in that sitemap as links, then the PDFs will not be checked.
- If the link to PDF is only found via a page that is inserted into your site using a Single Page Check then the PDF will not be crawled. This is because the Single Page Check analyzes the HTML of the page and checks all available links on it – it does not crawl any further than that. Single Page Checks are typically inserted via the CMS Plugin, Siteimprove Ads, an Integration (Marketing Automation), or directly via the Siteimprove platform.
- If the PDF level exceeds the max page level setting within the website structure, it will not be checked. The default is 50 levels. Contact Siteimprove to have this increased if required.
If you have any questions regarding this, please contact Siteimprove technical support.
Did you find it helpful?Send feedback