Effortlessly crawl large websites with Apache Storm, featuring fault tolerance and seamless integration with other Storm components.
StormCrawler is an open source distributed web crawler that is designed to crawl very large websites quickly by scaling horizontally. It is built on top of Apache Storm, a distributed real-time computation system, which allows StormCrawler to be highly scalable and fault-tolerant.
Some key features of StormCrawler include:
StormCrawler is designed to crawl complex sites and sitemaps efficiently without overloading targets. It respects robots.txt and has built-in throttling. Typical uses cases include search engine indexing, machine learning datasets, and archiving. With horizontal scalability and fault tolerance, StormCrawler provides a robust platform for large scale web crawling.
Here are some alternatives to StormCrawler:
Suggest an alternative ❐