Skip to main content

Siteimprove's Crawler: Frequently Asked Questions

Modified on: Thu, 31 Oct, 2024 at 7:58 PM

TABLE OF CONTENTS

What is the Siteimprove crawler?

Web crawlers are computer programs that scan the web, ‘reading’ everything they find. A crawler starts out by visiting your website and systematically identify all hyperlinks on all pages, it then follows them to their conclusion.
Our crawlers scan your website using Siteimprove servers from specific IP addresses with identifiable user agents. Our crawlers use HTTP (Hypertext Transfer Protocol) requests to collect the HTML code on which to carry out error checks.
The data harvested by the crawler is stored in Siteimprove's databases. Based on the content found on each page, information is reported to Siteimprove's online platform, i.e. accessibility issues, misspellings, broken links, etc.
Learn more about the Siteimprove crawler and how it identifies broken links.

Where can I find more information on my website crawl status?

You can find the most recent scan dates and the scan times for your sites in Crawler Management. 

Go to, Settings > Crawler Management


Please note that only Account Owners and Administrators have access to Crawler Management.

Why does Crawler Management show that a crawl is finished but I still can’t see it in QA Check history?

The crawl will show as finished in Crawler Management as soon as the crawl is complete, however, the QA check history will only show when the full scan, including processing of data (link checking, accessibility, etc.) is complete.

At, Settings > Crawler Management > Scan History, we show each stage of the scan and the status. If any stage in the scan history table says “Pending” then that scan is not complete.

The QA check history, along with all the data in the platform will only update when a full scan is complete.

The screenshot below shows, the crawl got done but processing the data found in the crawl did not finish. Therefore, the QA check history won’t update.

Scan_history.png

You can read more about the scan stages in the scan process description.

When crawling a site, we analyze (parse) all the URLs. Afterward, we process the data, which includes removing links/pages based on exclusions, aliases, deduplication rules, etc. configured for your website.

  • Crawler Management shows all the pages and links found during a crawl.
  • QA Check history will show the pages and links that have been stored after site content settings, deduplication rules, etc. have been processed.
  • Crawler Management shows all the pages and links that we have seen during a crawl.
  • QA Check History shows the pages and links that have been stored after the crawl data has been processed­—meaning those we have found, minus the pages/links that have been excluded due to site content settings.

See Site Content Settings for information. 

Why does Crawler Management show 0 pages for a site but the products (QA, Accessibility, Policy, SEO, Data Privacy) show all pages?

If we find 0 pages in a crawl, then Crawler Management will show 0 pages, but QA still stores all the pages from the last successful scan. This state will remain until there is a new successful scan that completes all three stages (queue, crawl, processing).

The crawl may find 0 pages due to a site being down temporarily, but this mechanism means users can still work on the results of the last successful scan until the next scan completes. See also "Typical Reasons for Crawl Problems".

How often is my website crawled?

By default, our servers crawl your website with a crawl frequency of 5 days. This means that 5 days after the scan has completed processing your site will enter the crawl queue again - then the scan process restarts.

Between the 5 day crawls we carry out periodic re-checks of broken links and pages with broken links, if the content has changed.

It is possible to change this schedule. If you would like to change the frequency of your crawl, please contact Siteimprove.

How would switching from a non-JS crawl to a JS crawl affect my data?

See the article "How would switching from a non-JS crawl to a JS crawl affect my data?"

Where can I see the last and next crawl date?

You can see the last scan and next crawl dates in Settings > Crawler Management > Site overview.

You'll also see both the last and next crawl dates on the Crawl details widget on the Quality Assurance (QA), Accessibility, and SEO Overview pages.

What content can Siteimprove's Crawlers crawl?

  • HTML
  • XML
  • All non-scripted content
  • Scripted content (such as JavaScript & AJAX)*
  • Dynamically loaded content (written text in images, videos, etc.)*

*This type of content is scanned if the JavaScript crawler is enabled.

What content can Siteimprove's Crawlers not crawl?

  • Online shops (such as Shopping Carts)
  • Payment verification
  • Content requiring interaction to be available, such as pages only available if searched for, and forms depending on fields being filled in.
  • Software products/apps

For more information on this see the article, Can Siteimprove crawl Single Page Applications (SPAs) and forms?

Why does “Next crawl scheduled” show a date in the past on the Crawl details widget?

The Crawl details widget can show a “Next crawl scheduled” date in the past when the site is in a queue waiting for a crawl slot. See the Crawler queuing for more information on queuing.

This can occur if your account allows for too few max simultaneous crawls. It will usually resolve itself, otherwise, feel free to contact Siteimprove Technical support. Read more about Maximum simultaneous crawls.

If your site have been set up with a custom crawl schedule, this may also cause the “Next crawl scheduled” to be out of sync.

Instead of relying on the Crawl details widget, please use the Crawler Management -> Scan History for more details on the crawl run times. 

This can be found here https://my2.siteimprove.com/Settings/CrawlerManagement/History 


Where can I see when a specific page was crawled?

At the top of the Page Report menu, you can see the date and time that specific page was last checked.
date_and_time_on_page_report

Can I recheck my site or pages outside of the normal crawl schedule?

Yes, it is possible to initiate a recheck at the following levels:

  • Single page
  • Multiple pages
  • Group of pages
  • Entire site

Learn more on how to re-crawl your pages, groups, and sites.


Note: Crawl duration varies depending on the number of pages on your site and the number of sites on your account crawling simultaneously.

Can I prevent specific sections of my site from being crawled?

Yes, you can set up site content settings to include and exclude content and to remove links from your site's index.

Which products are impacted by Site Content Settings?

Site Content Settings affect data in your content site. Content sites are used by the QA, Accessibility, SEO, and Policy products. 

Site content settings will not affect data in any other Siteimprove products including Analytics, Ads, or Performance.

Does the Siteimprove crawler consider "noindex" or "nofollow" when deciding what pages to include?

No, our crawler does not consider "noindex" or "nofollow" when determining what content to crawl. 

How do I cancel the crawl of a website?

To cancel or stop a crawl on a website please contact the Siteimprove technical support team with details of the site account and URL.


What steps can be taken to reduce unnecessary load on the webserver during crawling?

  • Siteimprove uses intelligent algorithms and looks at several parameters to determine when and what to re-check. For example, we use an MD5 key to determine if the page has changed; if the page has not changed there is no need for a recheck.
  • The default delay between HTTP requests is 200 milliseconds. Pauses of any time up to 20,000 milliseconds between requests will be added automatically if we suspect the crawler is affecting the site's performance.
  • If necessary, pauses between HTTP requests can be added manually by Siteimprove.
  • We automatically stop crawling the site if we get several time-outs or if we notice internal errors from the website server.
  • You can change the Site Content Settings to remove links from being checked or to exclude content from the site.
  • The crawl can be configured to start at a particular time/day by request.
  • Siteimprove can exclude parts of the site from a crawl.
  • By request, we can check the site less frequently than every 5 days.
  • By default, we limit the number of simultaneous crawls running on one account to two at a time.


If you would like any of the above settings changed for a crawl on your website, please contact Siteimprove Support.


Are all checks performed during the crawl?

No. Many checks are performed after the crawl is complete. The image below can be used as a rough guide to illustrate checks that will typically continue after the crawl has ended.
Crawl_and_check_sequenceWhat are some typical reasons for a problem with a website crawl?

For information on this see the article "Typical reasons for crawl problems".


Where can I find the IP address and User-agent string of the crawler?

The crawler IP address and User-agent strings can be found in the article - What IP addresses and user agents are used by Siteimprove?

Did you find it helpful? Yes No

Send feedback
Sorry we couldn't be helpful. Help us improve this article with your feedback.