Site Content Settings
- What does Siteimprove check?
- What does Siteimprove not check?
- What does Siteimprove crawl?
- Define what is part of the site
- General information on adding URLs in site content settings
- Testing your URL-matches
- URL-matches need no placeholders to match URLs.
- When are changes on the site crawl settings taking effect?
- Best practices for adding URLs in site content settings
- Use Cases for including content
- Use Cases for excluding content
- Use Cases for removing links
- Note on old “global exclusions”
Where do I find the Site Content Settings?
You can find the Site Content Settings section under Settings -> Crawler Management -> Site Content Settings
What does Siteimprove check?
Siteimprove checks all content that is crawled and defined as part of your site. The following content is checked for issues within the Siteimprove products:
- Content on internal HTML pages. Internal HTML pages are pages that are inside your site as defined in the “Include content” tab in the site content settings.
- Content on internal PDFs. Internal PDFs are PDFs that are inside your site as defined in the “Include content” tab in the site content settings.
- All links within a site. Both internal and external links are checked.
What does Siteimprove not check?
Siteimprove does not check content that is outside your site, including:
- Content on external HTML pages. External HTML pages are pages that are outside of your site. Pages are considered to be outside of your site if they are not included in the “Include content” tab or if they are excluded, either via the “Exclude content” tab or the “Remove links” tab.
- Content in external PDFs. External PDFs are PDFs located outside of your site. PDFs are "outside of your site" if they are not in your “Include content” tab or if they are excluded via the “Exclude content” tab or the “Remove links” tabs.
- Links found on external HTML pages or in external PDFs.
What does Siteimprove crawl?
Siteimprove crawls by following links (URLs) from one HTML page to another. Siteimprove starts a crawl at the index URL that is set when the site is added.
The crawler sees links on the first page of your site, i.e. the index, URL.
Links from that page either lead to internal pages and PDFs that should be checked or they lead to external pages and PDFs that are not your responsibility and should not be checked.
The crawler will then either:
- Follow all links that are defined as internal content inside the site.
- OR not follow any links that are defined as external, as those go to content outside your site.
Illustration of your internal content site within the external content in the world wide web:
Define what is part of the site
Site content settings can be used to define how the crawler should behave. This means identifying which HTML pages, PDFs, and other documents or assets should end up as part of your internal site content and what is external content. In this process, we recommend that you:
- Include content in the site to turn external content into internal content. This process allows you to check more PDFs or HTML pages on the site.
- Exclude content from the site to turn internal content into external content. In this process, you'll be telling the system which PDFs and HTML pages on the site it should not check.
- Remove links to tell Siteimprove to avoid specific links. A link matching a URL in the “removed links” section will be avoided during the crawl and no HTTP status check will be done for it. This means that, if the link is broken but also removed via “Remove links,” Siteimprove will not report it as broken.
- Removing internal links from the site will also lead to them not being crawled, resulting in a reduced overall page count. This is because removing those links prevents the crawler from checking the HTML pages, PDFs, and content on the other side of those links from being found.
- Removing external links from the site will generally not cause a reduced page count. It may, however, result in fewer indexed PDFs getting in the Documents inventory in Quality Assurance. This can lead to fewer broken links found on the site.
Adding URLs in site content settings
Testing your URL-matches
To see which links will be affected using a URL-match any of the site content settings and search for it in the links within index in Quality Assurance > Inventory > Links.
To see which pages will be affected using a URL, match any of the site content settings and search for it in the page index in Quality Assurance > Inventory > Pages.
Removing links or excluding content for specific URL matches can affect pages and links living deeper within the site.
The following example shows how this is possible, looking at how the crawler would find a certain link.
Index-URL: Page A (domain.com)
-> Link 1 seen on Page A (domain.com/product/feature1)
-> Page B (domain.com/product/feature1)
-> Link 2 seen on Page B (domain.com/legal/legal-info-about-feature1)
-> Page C (domain.com/legal/legal-info-about-feature1)
Configuring the exclude content rule
This is where you set up a rule to exclude content with URL-match /domain.com/product/feature1 because you don’t want to crawl that page and remove the page from your index. In the next crawl, we will not find this page (Page B) or the link on Page B that leads to Page C. Page C will therefore also be removed.
URL-matches do not need placeholders to match URLs.
- (/news will match the URL yourdomain.com/news/article1)
Excluding or including content means that you are changing existing links on your site to either external or internal. It does not mean that you are creating an additional index URL or Single Page Check.
- Using the include content functionality by adding a rule to include a full URL will not guarantee that the URL will be crawled. If there is an external link on the website that matches the URL in the inclusion rule, Siteimprove will see the external link as an internal link.
If you would like to check a specific page, please go to Quality Assurance > Summary > Single Page Check.
- Example (successful include content rule): You have 100 links on your site with index-URL www.yourdomain.com. 50 of them are leading to external pages, a subdomain of your domain (blog.yourdomain.com). You want the subdomain to be internal, so you set up an include content rule for blog.yourdomain.com. The site will then see the 50 links to blog.yourdomain.com as internal and the crawler will follow those links and index them as internal pages within your site.
- Example (unsuccessful include content rule setup): Using the above example, say you want another subdomain to be included, i.e. news.yourdomain.com. None of the 100 links on your site link to news.yourdomain.com. The next crawl would follow those links if they were located on pages on www.yourdomain.com or on other internal pages that you might have added as included content, like news.yourdomain.com. ince there is no link to news.yourdomain.com on the site, the crawler can also not follow it.
When do changes on the site crawl settings take effect?
Rules to include content will be applied immediately, but newly included pages will only be found during a new crawl
- Without a new crawl, Siteimprove can not follow the newly added internal content. Start a new crawl to find newly included content. If there is new content found using the internal content rule, you can expect the page count to increase.
Rules to exclude content will be applied shortly
- It may take up to a couple of minutes for the exclusion to take effect, depending on the size of your site.
- Pages matching your excluded URLs will be removed from the page index (QA > Inventory > Pages) immediately.
- Links matching the excluded URLs will remain in the links index (QA > Inventory -> Links). The status of the matching links will change from internal to external when the next crawl has run.
Rules to remove links will be applied shortly
- Link removal may take up to a couple of minutes, depending on the size of the site.
- Pages matching your excluded URLs will be removed from the page index (QA > Inventory > Pages) immediately.
- Links matching the excluded URLs be removed from the link index (QA -> Inventory > Links) immediately.
Best practices for adding URLs in site content settings
Use the full URL where possible. Don’t use just the subdirectory.
The longer the URL-match, the more specific your target should be. We have seen short URL matches added that led to far more content being included, i.e. more pages, than was intended. An increased page count also means your scans will take longer and this can lead to increased subscription costs.
- Example: A customer included the URL-match /news pages from multiple different domains to be part of the site. This happened because every external link containing the URL-match /news was then seen as internal. The crawl started crawling not only domainA.com/news but also domainB.com/news and domainC.com/news. Both domain B and domain C were not owned by our customer, causing a massive increase in their page count.
- For very specific inclusion or exclusion rules that you might have, you can contact our customer support team who can add rules using regular expression.
Use cases for including content
Include content in the site to turn external content into internal content to check more PDFs or HTML pages in the site.
If your index-URL is www.mydomain.comandyou also want to include your subdomains in the site:
- You can then include the specific subdomains using the specific URL-match for the full subdomains (https://news.mydomain.com; https://blog.mydomain.com; https://products.mydomain.com)
- Alternatively, you can use a broader URL-match to include all of your subdomains (.subdomain.com)
To ensure that PDFs on a specific location are seen as internal:
- If your index-URL is https://www.mydomain.com but your PDFs live on another domain or subdomain https://assets.mydomain.com/documents/pdfs/ then you will need to ensure a URL-match is set up for the PDFs that you want to see as internal.
Use cases for excluding content
Exclude content from the site to turn internal content into external content to check fewer PDFs and HTML pages in the site.
This is mostly used to reduce the page count of sites in Siteimprove. This might apply if you have 5000 pages in your site but you only want to check 3000 pages. To do this, you find a common URL-match for the remaining 2000 pages to prevent checking them.
Use cases for removing links
Remove specific links to prevent indexing and checking them. A link matching a URL in the “removed links” section will be discarded within the crawl and no HTTP status check will be done. Removing links from the crawl will lead to fewer indexed links and, as a result, to fewer HTML pages and PDFs in the site.
- Mostly used to remove links from the link check to speed up the link check process.
- Links that are recommended to be excluded, because they are taking up time in the link check process otherwise, are tracking pixel links / ads links. Popular examples include…
There might also be other ads-links on your website that are not relevant for a link checker to check. It is recommended to exclude links that don’t contain any content and that are not of any use to your users (such as tracking pixel links).
Note on old “global exclusions”
If the site you are working with has existed since before September 2020, then you might find rules to remove links with the note “Global exclusion matched by old crawler added as site exclusion” in your site. “Exclusion” was the term previously used for “removing links”.
These site exclusions have been created to allow for a smooth transition from the old Siteimprove crawler to the new Siteimprove crawler. The old Siteimprove crawler was in use up to September 2020 and since then all customer sites are crawled with the new Siteimprove crawler.
The old crawler had a setting to globally remove links for certain URL-matches on all sites across Siteimprove. These URL-matches were created by Siteimprove before September 2020. With the new crawler, it was decided to discontinue global exclusions and to allow the account owners and site administrators to make their own decisions on which links they want to remove from their site instead.
The global exclusions that matched URLs on a given site were created as site exclusion on that site in order to not disrupt the experience with Siteimprove and to not impact the page count of sites globally.
If your site contains such rules to remove links, we recommend that account owners and administrators check if they should still be used, and delete the rules if they should not be applied any longer.
Did you find it helpful?Send feedback