Crawler Options
The crawl screen contains an expanding set of crawl options, revealed by clicking the (cog icon) near the crawl controls.
Thread Model. The Single thread model shares one thread for docs and assets. Gentle multithreading utilizes 1 document thread, 4 total (1/4). Standard multithreading doubles that to 2/8. JavaScript rendering is limited to Gentle. Default is Standard.
External Pages. GET collects externally linked HTML for search indexing. Ignore and NoRobots skip external page requests. NoRobots generates a database entry for the resource, where Ignore does not. Default is GET for Site/Directory and Ignore for Custom Lists.
Page Assets. GET downloads CSS, JavaScript, and other text content, and HTTP headers for everything else. GET+ will additionally download binary content without a Content-Length header to collect accurate file size. Ignore and NoRobots skip asset downloads. Default is GET for Site/Directory and Ignore for Custom Lists.
Crawl Delay. Delay in seconds before resuming crawl after processing a page. Lower allows for a faster crawl, at the expense of placing more stress on the server. This setting is necessary if you encounter 429 Too Many Requests response codes, for example. Default is 0.
Network Timeout. Network response in seconds before giving up on request. Increasing the timeout will allow for capturing slow responses, but can significantly slow the crawler. Default is 15.
JavaScript Rendering. Beta (Windows/Linux-only). Load pages with JavaScript, providing a comprehensive index of content at the expense of slower, more resource hungry crawling. Webpages external to the project URL will continue to use the classic crawler. This option is necessary for websites using JavaScript frameworks that don't provide server-side rendering. Default is Off.
Download Assets Without Content-Length. Whether to GET media BLOBs (video, audio, and other non-image binaries) without a content-length header supplied. True will allow for accurate resource size data, but can significantly slow the crawler. Default is off.
HTTP Request Headers. Override options for a handful of authentication and identity HTTP headers. Authentication headers will only apply to Site/Directory requests within the project domain. Changing the User-Agent header will not overrule robots.txt, as it applies to InterroBot.
Directory/Path Exclusions. You can treat certain URLs as ignored paths, and they will not be downloaded or indexed. Useful in avoiding known time-sink directories. Limit is 10 additional rules. Limit is 10 additional rules.
URL Rewrites. Remove search, marketing, and session arguments from resolved URLs to collapse alike and near-alike pages into one crawled page. Add your comma-separated arguments.
Follow All. In some cases, following rel "nofollow" links reveals intentionally SEO devalued pages. In others, nofollow is protecting you from navigation traps (e.g. endless calenders). Default is Off.
InterroBot is a web crawler and developer tool for Windows, macOS, Linux, and Android.
Want to learn more? Check out our help section or
download the latest build.