Explain How a Big Data Hadoop System Works in Simple Words?
Essay by Jyothi Allada • February 25, 2018 • Term Paper • 999 Words (4 Pages) • 1,089 Views
Essay Preview: Explain How a Big Data Hadoop System Works in Simple Words?
Explain how a big data hadoop system works in simple words?
Hadoop is an open-source framework that allows to store and process big data in a distributed environment across clusters of computers using simple programming models. Hadoop’s strength lies in its ability to scale up from single servers to thousands of machines, each offering local computation and storage.
Hadoop delegates tasks across these servers (called “worker nodes” or “slave nodes”), essentially harnessing the power of each device and running them together simultaneously. This is what allows massive amounts of data to be analysed: splitting the tasks across different locations in this manner allows bigger jobs to be completed faster. Hadoop comprises of many different components that all work together to create a single platform. There are two key functional components within this ecosystem: The storage of data (Hadoop Distributed File System, or HDFS) and the framework for running parallel computations on this data (MapReduce).
Hadoop Distributed File System (HDFS)
HDFS enables Hadoop to store huge files. It’s a scalable file system that distributes and stores data across all machines in a Hadoop cluster. Each HDFS cluster contains the following:
- NameNode: Runs on a master node that tracks and directs the storage of the cluster.
- DataNode: Runs on slave nodes, which make up the majority of the machines within a cluster. The NameNode instructs data files to be split into blocks, each of which are replicated three times and stored on machines across the cluster. These replicas ensure the entire system won’t go down if one server fails or is taken offline, known as fault tolerance.
- Client machine: neither a NameNode or a DataNode, Client machines have Hadoop installed on them. They’re responsible for loading data into the cluster, submitting MapReduce jobs and viewing the results of the job once complete.
MapReduce
MapReduce is the system used to efficiently process the large amount of data Hadoop stores in HDFS. MapReduce or YARN, are used for scheduling and processing. Hadoop MapReduce executes a sequence of jobs, where each job is a Java application that runs on the data. Instead of MapReduce, using querying tools like Pig Hadoop and Hive Hadoop gives the data hunters strong power and flexibility.
Hadoop workflow
The typical workflow of the Hadoop while executing a job includes:
- Loading data into the cluster/HDFS
- Perform the computation using MapReduce jobs
- Store the output results again in HDFS
- Retrieve the results from the cluster/HDFS
Consider an instance where we have all the promotional emails sent to our customers and we want to find to how many people were sent a discount coupon “DISCOUNT25” in a particular campaign. We can load this data to HDFS and then write a MapReduce job which will read all the emails and see if they contain the required word and count the number of customers who received such emails. Finally, it stores the result to HDFS and from there we can retrieve the result.
What is Map - Reduce and how does it work?
Hadoop MapReduce is the heart of the Hadoop system. It provides all the capabilities we need to break big data into manageable chunks, process the data in parallel on a distributed cluster, and then make the data available for user consumption or additional processing. And it does all this work in a highly resilient, fault-tolerant manner.
Hadoop MapReduce includes several stages, each with an important set of operations helping to get to the goal of getting the answers we need from big data. The process starts with a user request to run a MapReduce program and continues until the results are written back to the HDFS. HDFS and MapReduce perform their work on nodes in a cluster hosted on racks of commodity servers.
...
...