What is Disco MapReduce?
Disco is an open-source MapReduce framework originally developed by Nokia for distributing the computing workloads of extremely large data sets across clusters of commodity hardware. It is designed to be scalable, fault-tolerant and easy to use.
Some key features of Disco MapReduce include:
- Automatic parallelization and distribution of MapReduce jobs
- Fault tolerance - automatic retry of failed jobs
- Support for different storage systems like HDFS, Amazon S3
- Web-based job monitoring and control interface
- Lightweight Python programming interface
- Batch-oriented and stream-oriented MapReduce interfaces
Disco can handle very large data sets in the order of petabytes and scale to thousands of nodes. It has been used at Nokia for data-intensive processing use cases like clickstream analysis, data mining and machine learning.
Overall, Disco MapReduce provides a good open-source alternative to commercial solutions like Amazon EMR, with additional flexibility to run Disco on private cloud infrastructure.