Friday, October 7, 2016

Components of Big Data - Hadoop System

In this blog i will explain important components which are part of Hadoop System. I will give very brief overview of these components.

Below diagram shows very high level components in Hadoop system.


  • Master Node (MN)
    • Name Node (NN)
      • It is a daemon process runs on Master Node.
      • Takes care of reading the data file to be analyzed.
      • Splits the data file based on block size configured, default is 64MB and 128 MB.
      • Distributes the split data file across multiple Data Node.
      • Maintains the index file to keep track of where the data has been distributed. Think this as "Table of Content" in a book.
      • It provides input to Job Tracker for location of the data files in Data Node.
      • This is one part of HDFS system in Hadoop.
    • Job Tracker (JT)
      • Job tracker is also a daemon process.
      • This is part of Processing Engine of Hadoop system.
      • It is responsible for running the program which will analyze the data and produce results.
      • Job Tracker communicates with NN to identify the location of the the data file. Once the data node locations are identified this process will move the program to those Data Node for execution.
      • JT will try its best to run the Data file analysis local to the Node, so as to process it faster.
      • But incase if can't assign task local to Task Tracker then next it will look for node available in same Rack.
      • Once Job Tracker receives output from the multiple Task Tracker, it has to run the program again to consolidate the output and generate the final output for the analysis.
      • Job Tracker important role is to monitor all the task which are running in Task Tracker, and start new job if something fails.
  • Slave Node (SN)
    • Data Node (DN)
      • This is a Daemon process runs on Slave Node.
      • There can be many data nodes as Number of Slave Node can be more than 1.
      • Task of this process is to receive the data sent by Name Node.
      • DN is responsible for manitaining the data received from NN in the system. It keep track of these data file.
      • Together NN and DN forms and manages HDFS.
    • Task Tracker(TT)
      • This is Daemon process is running on Data Node.
      • Program sent from JT are received by this process and stored in the Slave Node.
      • After receiving the file it will initiates the program to analyze the data file.
      • Once the analysis is complete, it will produce the result and share it back with JT.
      • Combined together JT and TT are called MapReduce.
      • Task Tracker keeps sending heartbeat signal to Job Tracker so that Job Tracker understands that process is running fine. 
      • Incase if TT fails to send heartbeat to JT, then JT will re-initiate that process again in other available TT.
  • HDFS
    • Hadoop Distributed File System
    • Two important concept in HDFS
      • Block Size
      • Fault Tolerance/Failure 
    • Name Node and Data Node creates and manages the HDFS.
    • Name Node is the master which takes care of splitting the file and distributing it across multiple data node.
    • HDFS is fail safe system, which ensure that data which is stored is never lost, rather i should say chances of loosing data is very minimal.
    • HDFS ensure fault tolerance by keeping copy of same data file in multiple Data Node. NN maintains 3 copies of data file so that if any one data node crahses it can consume the backup file.
  • MapReduce
    • Map Reduce comes under processing engine of Hadoop system.
    • It is programming model which enables running process in Parallel which can be distributed across multiple nodes.
    • Job Tracker and Task Tracker makes up this Processing Engine of Hadoop system.
    • There are basically two pieces of MapReduce. 
    • First part is Map which analyzes the data file and produces an output, which goes as an input to Second part called Reducer. Map task job is to sort and filter the data.
    • Reducer takes this processed out and creates a consolidated reports.
  • Secondary Name Node
    • This process runs in another system in hadoop cluster.
    • It kicks in when main Name Node fails or goes down. 
    • It keep interacting with NN at regular interval and creates a backup of the index file in separate system. 
    • Backup system can be the one where secondary name node is running or it can be other system which is in some other location or other RACK.
    • Its task is to recreate new Name Node by reading the FSImage and Edit Logs.

No comments:

Post a Comment

Components of Big Data - Hadoop System

In this blog i will explain important components which are part of Hadoop System. I will give very brief overview of these components. Be...