Aliases and exclusions: How to add and remove content from a crawl
Aliases and exclusions can be used to further specify what domains, folders or pages should be included/excluded from your website crawl. Excluding content, as described below does not affect the data shown in Siteimprove Analytics. For that see "How to exclude traffic from Analytics".
Note: Only Administrators or Account owners can add/edit aliases and exclusions.
Changes related to aliases and exclusions can be configured under Settings > Content > Crawl settings.
- What is an exclusion?
- How do I add an exclusion on my site?
- What is an alias?
- How do I add an alias on my website?
- Is it possible to use regular expressions for exclusions and aliases?
What is an exclusion?
An exclusion is a method of specifying what pages should not be crawled using a URL match (e.g. an exclusion of /archive/ would let the crawler know to skip over any page with a URL containing "/archive/").
Matching pages will not be checked for broken links, misspellings, accessibility or SEO issues. They will not be included in the site Inventory.
Example reasons to add an exclusion:
- The URLs (pages/links) should not be checked, e.g. Archive.
- You have duplicate pages on your website already being checked, e.g. ?sort=ascen.
- Your website contains anchor-links, domain.com/page1/#contact, domain.com/page2/#contact and domain.com/page3/#contact that are not real pages but seen as duplicates, e.g. exclusion is /#contact.
- You have a large number of links leading to URLs with the same pattern (for example different intranet pages) our crawler cannot access and are therefore seen as broken links (403 Forbidden).
Note: When setting up exclusions only a partial match on the link is needed. A match of "/archive/" will apply to all links and pages containing "/archive/".
Also, if you exclude a page, you also exclude the links on that page unless the crawler can navigate to those links via a different page. Consider the structure in this example below, where each letter represents a page. If you exclude page C, that means that the crawler will never find pages E, F, and G (unless they are linked from another page).
How do I add an exclusion on my site?
- On the left-hand menu bar Select Settings > Content > Crawl Settings,
- Select the site for which you would like to add the exclusions,
- Click Exclude,
- Type in the URL of the exclusion match and click "Create exclusion",
- These setting changes will take effect after your next website crawl.
What is an alias?
An alias helps our crawler better determine what content is considered "internal" or "external" to your website using a URL match.
Note: An alias will not override exclusion settings.
For example, an alias can be used to specify whether pages on a subdomain should be included in your website crawl results (internal - will be checked) or factored out (external - will not be checked).
Reasons to add an alias include:
- You just got responsibility for a new subdomain (e.g. news.example.com) on your website domain (e.g. www(.) example.com) and you'd like it to be checked as part of the original site.
- You want to remove a section (e.g. /calendar/) from being crawled but you'd still like any links on your main site to that section to be identified as broken if found.
Internal content
An internal page is considered a part of your site and will be checked for broken links, misspellings, accessibility issues, etc. Content is treated as internal unless you select the "Crawl as external content" option when adding an alias.
Note: A link to the aliased domain must exist on the website for our crawler to index it. If the link is not available, then contact technical support, who can add an 'extra index URL' to achieve the same purpose. For example, if you want myothersite.demosite.com to be considered part of your site demo.com then, in addition to adding an alias, you will need to have a link to myothersite.demosite.com on at least one page of demosite.com.
External content
External content is not considered part of your site and will not be checked for broken links, misspellings, accessibility issues, etc. Content is treated as external if you select the "Crawl as external content" option when adding an alias.
If www(.)demosite.com/calendar/ is added as an alias, with "Crawl as external content" selected, any URL containing www(.)demosite.com/calendar/ will still be listed in the Inventory under Links (not pages), however, the content on the pages associated with the URLs will not be evaluated.
How do I add an alias on my website?
Note: When setting up an alias only a partial match on the link is needed. A match of "/calendar/" will apply to all links and pages containing "/calendar/".
- Select Settings > Content > Crawl Settings.
- Select the site for which you would like to add the Alias.
- Add the domain or URL match for the Alias you are adding.
- Select "Crawl as external content" if you are creating an external Alias. Do not select this option for an internal Alias.
- Click on "Create alias".
- These setting changes will take effect after your next website crawl.
Note: If you are setting up a domain alias, only a domain name is required. Typing in "example.com" automatically ensures that all subdomains are included; i.e. example.com, news.example.com, and any other subdomains that you may have. Conversely, if you identify a subdomain by typing in the alias news.example.com, only this subdomain will be included.
Is it possible to use regular expressions for exclusions and aliases?
Yes, it is possible. Please contact our Technical Support team who will help you set up exclusions/aliases with regular expressions.
Did you find it helpful? Yes No
Send feedback