Crawler Options

Home » Help »

The crawl screen contains an expanding set of crawl options, revealed by clicking the Options button (cog icon) near the crawl controls.

You can update the crawl settings as you go, adapting to the HTTP terrain.

Crawl Delay. Delay in seconds before resuming crawl after processing a page. Lower allows for a faster crawl, at the expense of placing more stress on the server. This setting is necessary if you encounter 429 Too Many Requests response codes, for example. Default is 0.

Network Timeout. Network response in seconds before giving up on request. Increasing the timeout will allow for capturing slow responses, but can significantly slow the crawler. Default is 15.

JavaScript Rendering. Beta (Windows-only). Load pages with JavaScript, providing a comprehensive index of dynamic content at the expense of slower, more resource-hungry crawling. Webpages external to the project URL will continue to use the classic crawler. This option is necessary for websites using JavaScript frameworks that don't provide server-side rendering. Default is off.

Download Assets Without Content-Length. Whether to GET media BLOBs (video, audio, and other non-image binaries) without a content-length header supplied. True will allow for accurate resource size data, but can significantly slow the crawler. Default is off.

HTTP Request Headers. Override options for a handful of authentication and identity HTTP headers. Authentication headers will only apply to requests within the project domain and changing the User-Agent header will not overrule robots.txt, as it applies to InterroBot.

Directory/Path Exclusions. You can treat certain URLs as robots-excluded paths, that is, they will not be downloaded. Useful in avoiding known time-sink directories. Limit is 10 additional rules.

InterroBot is a web crawler and developer tool for Windows, macOS, Linux, and Android.
Want to learn more? Check out our help section or download the latest build.