Update to the robots.txt parser

Modified on: Mon, 11 Apr, 2022 at 10:56 AM

Note: If you have previously added Siteimprove user agents with extra allow rules to your robots.txt to enable us to crawl your website, you will need to add either "SiteimproveBot" or "SiteimproveBot-Crawler" to control what the Siteimprove crawler will access during crawls.

What is happening?

We are updating the robots.txt parser used by the Siteimprove crawler to one that mirrors the parser behavior of search engines such as Google. A robots.txt file tells crawlers which URLs the crawler can access on your site.

When will the changes take effect?

The changes will be rolled out on May 2, 2022.

After this date crawls will use the new parser.

What is the robots.txt parser?

The Siteimprove crawler uses a robots.txt parser to determine which URLs/files are allowed or disallowed to be crawled on a specific domain. The robot.txt file (found at yourwebsite.com/robots.txt) is downloaded and examined for each domain. URLs are then checked according to the rules for the domain and either included or discarded.

Why is Siteimprove updating the parser?

Over the past few years, it has become clear that parsing of robots.txt files changes greatly across different libraries and technologies.

As most sites are using the robots.txt file to accommodate search engine crawlers, Siteimprove has decided to move to a mechanism that better suits this scenario and the changing technologies.

What difference will this make to how my robots.txt is interpreted?

There are three main differences:

The user agent tokens used for checking against robots.txt rules will change to "SiteimproveBot" and "SiteimproveBot-Crawler" (This will not change the user agent string used when fetching customer pages).
The parser will now support wildcard patterns such as (*$) but will not support regular expressions.
The agent matching will change from “substring matching” to “exact matching”.

How might this update affect my crawled data?

In most cases, you should not be affected by this update.

If your website does not use disallow rules in the robots.txt file, then you will not experience any changes in the crawl.

If however, you have specific rules allowing Siteimprove to access your sites, the crawler may be disallowed due to the update of the user agent tokens. See “What do I need to do” later in this article.

If you have rules containing wildcard patterns, the resulting number of crawled pages may change.

Customers can inspect their website's robots.txt by entering their domain appended by /robots.txt (e.g. yourwebsite.com/robots.txt). If in doubt, please contact your website's administrator.

What do I need to do?

If your website does not use disallow rules in the robots.txt file, then you do not need to do anything.

If your website has strict disallow rules in robots.txt files, please ensure your robots.txt file contains one of the following bot User-agents so we can continue to crawl your website:

SiteimproveBot
SiteimproveBot-Crawler

Your website's administrator will know how to update your robots.txt file.

Where can I see when this update occurred in the platform?

Graph annotations will be added within the Siteimprove platform indicating when sites were switched to the new robots.txt parser. This will allow you to see when the change happened along with any changes experienced.

The annotations will be visible in the history graphs in Quality Assurance, Accessibility, SEO, and Policy on the day of the robots.txt update.

Did you find it helpful? Yes No