StormCrawler

Name: StormCrawler
Author: Sugggest

StormCrawler is an open source web crawler designed to crawl large websites efficiently by scaling horizontally through Apache Storm. It is fault-tolerant and allows integration with other Storm components like machine learning pipelines.

Development Web Crawling & Scraping

crawler scraper storm distributed scalable

Features Reviews Alternatives

StormCrawler: Open Source Web Crawler

Effortlessly crawl large websites with Apache Storm, featuring fault tolerance and seamless integration with other Storm components.

What is StormCrawler?

StormCrawler is an open source distributed web crawler that is designed to crawl very large websites quickly by scaling horizontally. It is built on top of Apache Storm, a distributed real-time computation system, which allows StormCrawler to be highly scalable and fault-tolerant.

Some key features of StormCrawler include:

Horizontal scaling - By leveraging Apache Storm, StormCrawler can scale to very large websites by adding more resources and crawl instances.
Fault tolerance - Storm provides guaranteed message processing, which means if a crawl instance goes down, no data will be lost.
Extensibility - StormCrawler implements clear extension points and abstraction layers allowing custom implementations for fetching, parsing, indexing, etc.
Ease of configuration - Simple YAML config files define the crawl scope, scheduling, output targets like Elasticsearch.
Real-time processing - Crawl results can be processed in real-time by integrating other Storm components for tasks like machine learning or NLP.

StormCrawler is designed to crawl complex sites and sitemaps efficiently without overloading targets. It respects robots.txt and has built-in throttling. Typical uses cases include search engine indexing, machine learning datasets, and archiving. With horizontal scalability and fault tolerance, StormCrawler provides a robust platform for large scale web crawling.

StormCrawler Features

Features

Distributed web crawling
Fault tolerant
Horizontally scalable
Integrates with other Apache Storm components
Configurable politeness policies
Supports parsing and indexing
APIs for feed injection

Pricing

Open Source

Pros

Highly scalable

Resilient to failures

Easy integration with other data pipelines

Open source with active community

Cons

Complex setup and configuration

Requires running Apache Storm cluster

No out-of-the-box UI for monitoring

Limited documentation and examples

Official Links

Official Website
https://stormcrawler.net

Reviews & Ratings

No reviews yet

Be the first to share your experience with StormCrawler!

The Best StormCrawler Alternatives

Top Development and Web Crawling & Scraping and other similar apps like StormCrawler

Here are some alternatives to StormCrawler:

Scrapy

Crawlbase

Apache Nutch

Lookyloo

Mixnode

Heritrix

Suggest an alternative ❐

Scrapy

Scrapy is a fast, powerful and extensible open source web crawling framework for extracting data from websites, written in Python. Some key features and uses of Scrapy include:Scraping - Extract data from HTML/XML web pages like titles, links, images etc. It can recursively follow links to scrape data from multiple...

Compare Scrapy and StormCrawler

Crawlbase

Crawlbase is a powerful yet easy-to-use website crawler and web scraper. It allows you to efficiently crawl websites and extract targeted data or content into a structured format like CSV files or databases.Some key features of Crawlbase include:Intuitive visual interface for creating, managing and scheduling crawlersSupport for crawl depths, politeness...

Compare Crawlbase and StormCrawler

Apache Nutch

Apache Nutch is an open source web crawler software project written in Java. It provides a highly extensible, fully featured web crawler engine for building search indexes and archiving web content.Nutch can crawl websites by following links and indexing page content and metadata. It supports flexible customization and pluggable parsing,...

Compare Apache Nutch and StormCrawler

Lookyloo

Lookyloo is an open source web crawling and website analysis platform. It provides an extensible framework for developers and security researchers to build custom scrapers, analyzers, and visualizers to explore and monitor websites.Some key capabilities and features of Lookyloo include:Flexible crawling with support for depth-first, breadth-first, and manual/custom crawling.Plugin architecture...

Compare Lookyloo and StormCrawler

Mixnode

Mixnode is a privacy-focused web browser developed by Mixnode Technologies Inc. Its main goal is to prevent user tracking and protect personal data when browsing the internet.Some key features of Mixnode include:Blocks online ads and trackers by default to limit data collectionOffers encrypted proxy connections to hide user IP addresses...

Compare Mixnode and StormCrawler

Heritrix

Heritrix is an open-source web crawler software project that was originally developed by the Internet Archive. It is designed to systematically browse and archive web pages by recursively following hyperlinks and storing the content in the WARC file format.Some key features of Heritrix include:Extensible and modular architecture based on Apache...

Compare Heritrix and StormCrawler

ACHE Crawler

ACHE Crawler is an open-source web crawler written in Java. It provides a framework for building customized crawlers to systematically browse websites and collect useful information from them.Some key features of ACHE Crawler include:Scalable architecture based on distributed computing to crawl large sites quicklyFlexible plugin system to add customized data...

Compare ACHE Crawler and StormCrawler

StormCrawler

StormCrawler: Open Source Web Crawler

What is StormCrawler?

StormCrawler Features

Features

Pricing

Pros

Cons

Official Links

Reviews & Ratings

No reviews yet

The Best StormCrawler Alternatives

Top Development and Web Crawling & Scraping and other similar apps like StormCrawler

Scrapy

Crawlbase

Apache Nutch

Lookyloo

Mixnode

Heritrix

ACHE Crawler

Company

Explore

Resources