Advantages and Disadvantages of Hadoop and Spark

Apache Hadoop is a platform that got its start as a Yahoo project in 2006, which became a top-level Apache open-source project afterward. This framework handles large datasets in a distributed fashion. The Hadoop ecosystem is highly fault-tolerant and does not depend upon hardware to achieve high availability. This framework is designed with a vision to look for the failures at the application layer.


It’s a general-purpose form of distributed processing that has several components: 

  • Hadoop Distributed File System (HDFS): This stores files in a Hadoop-native format and parallelizes them across a cluster. It manages the storage of large sets of data across a Hadoop Cluster. Hadoop can handle both structured and unstructured data. 
  • YARN: YARN is Yet Another Resource Negotiator. It is a schedule that coordinates application runtimes.
  • MapReduce: It is the algorithm that actually processes the data in parallel to combine the pieces into the desired result. 
  • Hadoop Common: It is also known as Hadoop Core and it provides support to all other components it has a set of common libraries and utilities that all other modules depend on.

Hadoop is built in Java, and accessible through many programming languages, for writing MapReduce code, including Python, through a Thrift client. It’s available either open-source through the Apache distribution, or through vendors such as Cloudera (the largest Hadoop vendor by size and scope), MapR, or HortonWorks. 


Advantages and Disadvantages of Hadoop –

Advantage of Hadoop:

  1. Cost effective. 
  2. Processing operation is done at a faster speed.
  3. Best to be applied when a company is having a data diversity to be processed.
  4. Creates multiple copies.
  5. Saves time and can derive data from any form of data.


Disadvantage of Hadoop:

  1. Can’t perform in small data environments
  2. Built entirely on java
  3. Lack of preventive measures
  4. Potential stability issues
  5. Not fit for small data


What is Spark?

Apache Spark is an open-source tool. It is a newer project, initially developed in 2012, at the AMPLab at UC Berkeley. It is focused on processing data in parallel across a cluster, but the biggest difference is that it works in memory. It is designed to use RAM for caching and processing the data. Spark performs different types of big data workloads like:

  • Batch processing.
  • Real-time stream processing. 
  • Machine learning.
  • Graph computation.
  • Interactive queries. 

There are five main components of Apache Spark:

  • Apache Spark Core: It is responsible for functions like scheduling, input and output operations, task dispatching, etc.
  • Spark SQL: This is used to gather information about structured data and how the data is processed.
  • Spark Streaming: This component enables the processing of live data streams. 
  • Machine Learning Library: The goal of this component is scalability and to make machine learning more accessible.
  • GraphX: This has a set of APIs that are used for facilitating graph analytics tasks.


Advantages and Disadvantages of Spark-

Advantage of Spark:

  1. Perfect for interactive processing, iterative processing and event steam processing
  2. Flexible and powerful
  3. Supports for sophisticated analytics
  4. Executes batch processing jobs faster than MapReduce
  5. Run on Hadoop alongside other tools in the Hadoop ecosystem


Disadvantage of Spark:

  1. Consumes a lot of memory
  2. Issues with small file
  3. Less number of algorithms
  4. Higher latency compared to Apache fling