A scalable web crawler framework for efficient downloading and change detection, built on Scala and Apache Spark.
Arachne is an open-source web crawler framework developed by Internet Archive in 2018. It is designed specifically for building high-performance and scalable web crawlers that can efficiently download web pages and detect changes across periodic crawls.
Some key features and capabilities of Arachne include:
Arachne is well-suited for building archive-quality web crawlers, focused topical crawlers, or applying custom analytics on web page corpus. Its distributed capabilities can scale to handle even the largest web crawling needs.
View all Arachne alternatives with detailed comparison →