Storing XML-sitemaps used as index-URLs
Modified on: Mon, 22 Feb, 2021 at 4:45 PM
What is an Index-URL?
Sites that are added to an account need to be given a starting point to be crawled. The results of the crawl will be used for the content products (e.g. Quality Assurance, Accessibility, SEO Basic, etc). The starting point for the crawl is called the index-URL. This is the page from which the crawler can follow links to further pages. By doing this, the crawler is building the page index for a site.
The index-URL for most of our customers’ sites is the frontpage/homepage of the website that should be crawled. For some sites, it makes more sense to use an XML-file, such as an XML-sitemap, as index-URL. This allows for easier indexing of the website's pages.
How did Siteimprove treat XML-files used as index-URLs so far?
The old Siteimprove crawler, which was phased out in 2020, stored XML-files used as Index-URLs as “pages”, treating them the same way as we would treat HTML-pages. This led to some misleading information in the Siteimprove products such as showing accessibility issues on an XML-files that has been checked as if it were an HTML-page.
Also up until now, any sites on the new crawler had not stored XML-files and instead discarded them after the crawl was complete. This avoided misleading information as we did not run accessibility checks or SEO checks on the XML-files.
What is changing?
The Siteimprove crawler will again store XML-files used as index-URLs, but not as “pages”. Instead, we will store them as resources, i.e. not checking them for HTML-page issues in Accessibility and SEO. This will lead to improvements for data in the Siteimprove products.
Why are XML-files used as Index-URLs now stored again?
By storing XML-files used as Index-URLs as resources we will meet three needs our customers have voiced:
Running linkchecks on the XML-files
Crawling nested XML-sitemaps
Allowing PDFs added to XML-sitemaps to be crawled and checked for accessibility issues.
How will this affect Siteimprove customers?
This change will only affect sites that are using XML-files as index-URLs. If your sites are using an HTML-page as index-URL then nothing will change for you.
If you are using an XML-file as index-URL for a site, then you may notice:
An increase in broken links shown in Quality Assurance.
An increase in the number of pages if your XML-sitemap contains nested sitemaps.
An increase in PDFs being checked if they were linked from the XML-sitemap.
When will this change happen?
The new way of storing XML-files used as Index-URLs instead of discarding them will be in place from 22nd February 2021.
EDIT: This change is now in place. The above mentioned impact should be seen on a site from the first crawl happening on it after 22nd February 2021.