An open-source, extensible web crawler project built on Apache stack for archiving periodic captures of content from the web and large intranets.
Heritrix is an open-source web crawler software project that was originally developed by the Internet Archive. It is designed to systematically browse and archive web pages by recursively following hyperlinks and storing the content in the WARC file format.
Some key features of Heritrix include:
Heritrix is well-suited for building specialty search engine indexes, archiving online content for preservation purposes, and offline browsing of websites. Its focused crawling features allow users to customize crawl scopes and avoid unrelated content.
While Heritrix excels at archival-quality crawls, it has higher overhead than some other crawler software. It prioritizes quality, completeness and politeness over crawling speed. Heritrix is typically run on powerful servers and can handle complex, large-scale web crawling projects.
Here are some alternatives to Heritrix:
Suggest an alternative ❐