Why is my site not Crawling? Understanding crawl errors

Modified on: Mon, 23 Jun, 2025 at 9:56 PM

At Siteimprove, we help you keep your websites healthy by regularly crawling them and checking for issues. This ensures your content is always up to date and optimized. However, sometimes a site crawl fails, and no pages are found.

This can happen for several reasons—maybe the site no longer exists, your server is blocking our crawler, or the site settings are preventing access.

To help you quickly understand and resolve these issues, we’ve listed the most common crawl errors below with simple explanations and what you can do about them.

Common Crawl Errors Explained

Final URL is not internal
- What it means: The starting URL (index URL) redirects to a URL that doesn’t match your site’s configured internal content rules.
- Example: Your crawl might expect to start at http://www.example.com/ but end up at https://example.com:443/about-us.
- What to do: Make sure the start URL matches your internal content and doesn’t redirect elsewhere. The internal domain is set by both the index URL and any include content rules you have set within your site content settings. The last slash of the index URL defines the internal content in the site; both http and https are set to be internal based on the index URL's pattern. In the above example, the internal content would be either http:// or https:// followed by www.example.com/. https://example.com:443/about-us is not part of that pattern, both due to the missing www., but also because of the URL ending in :443/ .
Request URL is external
- What it means: The index URL is blocked by an “exclude-content” rule in your site settings.
- Example: Your crawl might expect to start at http://www.example.com/about but the site content settings contain a rule to exclude content containing /about.
- What to do: Check your site’s exclude content rules and adjust them if needed. Remove the rule that turns the index URL into an external page.
404 Not Found Check Mark
- What it means: Your web server has responded with an HTTP status of 404, which indicates the requested page isn’t available.
- What to do: Double-check the URL. If the page was removed, either delete the site or change the index URL to crawl something else. When crawling entirely different domains, Siteimprove recommends to creating a new site and deleting the old site instead of changing the old site’s index URL.
Request URL is excluded
- What it means: The index URL is blocked by a remove link rule in your site settings.
- Example: Your crawl might expect to start at http://www.example.com/about but the site content settings contain a rule to remove links containing /about or .com or example.
- What to do: Check your site’s remove link rules and adjust them if needed. Delete the rule that removes the index URL from your site.
403 Forbidden
- What it means: Our request is being blocked. This could be because the site is behind a login or your web server’s application firewall is blocking the request.
- What to do: Contact your IT department or web agency to understand why the Siteimprove crawler is not allowed to enter. Check if they can grant the Siteimprove crawler access to the index URL. View our help center for the IP addresses and user agents used by Siteimprove.
No handlers able to process resource
- What it means: The URL points to a file that we can’t crawl (like .json).
- What to do: Use a standard web page (HTML/XML) as your start URL.
HTTP 429 Too Many Requests
- What it means: Your site is overwhelmed by too many requests. Your server does not want Siteimprove to be sending requests as often as it does.
- What to do: Check with your hosting provider why Siteimprove is blocked with a 429 error when trying to request the index URL. Check whether the robots.txt file of your domain indicates a specific crawl-delay. Check with Siteimprove Support whether that crawl delay is respected.
HTTP 401 Unauthorized
- What it means: The site requires a login or is blocking our crawler.
- What to do: Test the index URL in your browser from an incognito browser to understand where it lands. If a login is required, ensure that authentication for the site crawl is configured. If it is already configured, ensure you update your login credentials.
Final URL is excluded
- What it means: The index URL redirects to a URL that is blocked by a remove-link rule in your site settings.
- Example: Your crawl might expect to start at http://www.example.com/about. This URL gets redirected to http://www.example.com/index.html. However, the site content settings contain a remove link rule to remove links content containing /index.
- What to do: Check your site’s remove link rules and adjust them if needed. Delete the rule that removes the index URL from your site.
Redirect chain contains excluded URL
- What it means: The redirect path includes a blocked page. The index URL redirects a number of times to different URLs until it reaches an end URL. One or more of the URLs in that redirect chain are blocked by a remove link rule in your site settings.
- What to do: Check your site’s remove link rules and adjust them if needed. Delete the rule that removes the index URL from your site.
HTTP 400 Bad Request
- What it means: The server couldn’t understand the request.
- What to do: Check the URL for typos or formatting issues.
HTTP 410 Gone
- What it means: The page has been permanently removed.
- What to do: Remove or update the site entry.
Authentication failure
- What it means: We couldn’t log in to your site.
- What to do: Make sure that the login credentials are correct and consider whether the method for logging in to the site has changed since it was first configured. If the credentials have changed, please update these in Site summary in Manage sites. If the way of logging into the site has changed, please contact Technical Support to have them revisit these settings. You may need to take one or both actions.
Proxy error
- What it means: A network or proxy issue blocked the crawl.
- What to do: Check your proxy settings or contact Technical Support.
Rendering error
- What it means: The page was too complex to load.
- What to do: Simplify the page or contact Technical support for help.
Connection error
- What it means: We couldn’t connect to the site.
- What to do: Check your server or network settings, or contact Technical Support for help.
Max tries reached (Other)
- What it means: We tried multiple times but couldn’t reach the site.
- What to do: Log a ticket with Technical Support to have this issue investigated further.
500 Internal Server Error
- What it means: The server had an unexpected issue.
- What to do: Check with your hosting provider.
HTTP 503 Service Unavailable
- What it means: The server is down or overloaded.
- What to do: Try again later or contact your provider.
Index-URL blocked by robots.txt
- What it means: The index URL you would like us to crawl in this site is blocked by the domain's robots.txt file. One or multiple rules in the robots.txt file are disallowing the Sitemprove bot to access this site’s index URL.
- What to do: Avoid disallowing the Siteimprove bot in your robots.txt. First, verify this on your robots.txt file. To get to your robots.txt file, type your domain into your browser and append /robots.txt (example: www.example.com/robots.txt). Siteimprove interprets robots.txt files the same way as Google does. Read more about how the robots.txt gets interpreted by a bot.
  
  Your robots.txt may restrict all bots from crawling the site. If that is the case, your robots.txt file would look like this:
```
User-agent: * 
Disallow: / 
```
  To ensure Siteimprove can crawl the pages on the domain, but still keep other bots out, you could change the robots.txt file to the following:
```
User-agent: * 
Disallow: / 

User-agent: SiteimproveBot 
Disallow:
```
Unidentified error
- What it means: We couldn’t crawl the site and don’t know why.
- What to do: Log a ticket with Technical Support to have this issue investigated further.

Did you find it helpful? Yes No

Why is my site not Crawling? Understanding crawl errors

Common Crawl Errors Explained

Final URL is not internal

Request URL is external

404 Not Found Check Mark

Request URL is excluded

403 Forbidden

No handlers able to process resource

HTTP 429 Too Many Requests

HTTP 401 Unauthorized

Final URL is excluded

Redirect chain contains excluded URL

HTTP 400 Bad Request

HTTP 410 Gone

Authentication failure

Proxy error

Rendering error

Connection error

Max tries reached (Other)

500 Internal Server Error

HTTP 503 Service Unavailable

Index-URL blocked by robots.txt

Unidentified error