A scalable web crawler framework for efficient downloading and change detection, built on Scala and Apache Spark.
Arachne is an open-source web crawler framework developed by Internet Archive in 2018. It is designed specifically for building high-performance and scalable web crawlers that can efficiently download web pages and detect changes across periodic crawls.
Some key features and capabilities of Arachne include:
Arachne is well-suited for building archive-quality web crawlers, focused topical crawlers, or applying custom analytics on web page corpus. Its distributed capabilities can scale to handle even the largest web crawling needs.