Apache Hadoop for Big Data Management

By Ms.Palak Gupta

Big data technologies are important in providing accurate analysis, which may lead to more concrete decision-making resulting in greater operational efficiencies, cost reductions, and reduced risks for the business. To harness the power of big data, you would require an infrastructure that can manage and process huge volumes of structured and unstructured data in real-time and can protect data privacy and security. There are various technologies in the market from different vendors including Amazon, IBM, Microsoft, etc., to handle big data.

The major challenges associated with big data are as follows:

  • Capturing data
  • Curation (Management)
  • Storage
  • Searching
  • Sharing
  • Transfer
  • Analysis
  • Presentation

To fulfill the above challenges, organizations normally take the help of enterprise servers

Big Data Tools

  • NoSQL

DatabasesMongoDB, CouchDB, Cassandra, Redis, BigTable, Hbase, Hypertable, Voldemort, Riak, ZooKeeper

  • MapReduce

Hadoop, Hive, Pig, Cascading, Cascalog, mrjob, Caffeine, S4, MapR, Acunu, Flume, Kafka, Azkaban, Oozie, Greenplum

  • Storage

 S3, Hadoop Distributed File System

  • Servers

 EC2, Google App Engine, Elastic, Beanstalk, Heroku

  • Processing

R, Yahoo! Pipes, Mechanical Turk, Solr/Lucene, ElasticSearch, Datameer, BigSheets   

Hadoop – Big Data Solutions

Traditional Approach-

  • In this approach, an enterprise will have a computer to store and process big data. Here data will be stored in an RDBMS like Oracle Database, MS SQL Server or DB2 and sophisticated software can be written to interact with the database, process the required data and present it to the users for analysis purpose.
  • This approach works well where we have less volume of data that can be accommodated by standard database servers, or up to the limit of the processor which is processing the data. But when it comes to dealing with huge amounts of data, it is really a tedious task to process such data through a traditional database server.

Modern Approach-

  • Google solved this problem using an algorithm called MapReduce. This algorithm divides the task into small parts and assigns those parts to many computers connected over the network, and collects the results to form the final result dataset.
  • Above diagram shows various commodity hardwares which could be single CPU machines or servers with higher capacity.

Hadoop and its Framework

Doug Cutting, Mike Cafarella and team took the solution provided by Google and started an Open Source Project called HADOOP in 2005 and Doug named it after his son’s toy elephant. Now Apache Hadoop is a registered trademark of the Apache Software Foundation. Hadoop runs applications using the MapReduce algorithm, where the data is processed in parallel on different CPU nodes.

In short, Hadoop framework is capable enough to develop applications capable of running on clusters of computers and they could perform complete statistical analysis for huge amounts of data. Hadoop is an Apache open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models. A Hadoop frame-worked application works in an environment that provides distributed storage and computation across clusters of computers. Hadoop is designed to scale up from single server to thousands of machines, each offering local computation and storage.


Hadoop Framework /Architecture

Hadoop framework includes following four modules:

  • Hadoop Common: These are Java libraries and utilities required by other Hadoop modules. These libraries provides filesystem and OS level abstractions and contains the necessary Java files and scripts required to start Hadoop. 
  • Hadoop YARN: This is a framework for job scheduling and cluster resource management.
  • Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
  • Hadoop MapReduce: This is YARN-based system for parallel processing of large data sets.

Since 2012, the term “Hadoop” often refers not just to the base modules mentioned above but also to the collection of additional software packages that can be installed on top of or alongside Hadoop, such as Apache Pig, Apache Hive, Apache HBase, Apache Spark etc.

  • Apache HBase is used to have random, real-time read/write access to Big Data.It hosts very large tables on top of clusters of commodity hardware.
  • The Hadoop ecosystem contains different sub-projects (tools) such as Sqoop, Pig, and Hive that are used to help Hadoop modules.
  • Sqoop: It is used to import and export data to and from between HDFS and RDBMS.
  • Pig: It is a procedural language platform used to develop a script for MapReduce operations.
  • Hive: It is a platform used to develop SQL type scripts to do MapReduce operations.

Advantages of Hadoop

  • Hadoop framework allows the user to quickly write and test distributed systems. It is efficient, and it automatic distributes the data and work across the machines and in turn, utilizes the underlying parallelism of the CPU cores.
  • Hadoop does not rely on hardware to provide fault-tolerance and high availability (FTHA), rather Hadoop library itself has been designed to detect and handle failures at the application layer.
  • Servers can be added or removed from the cluster dynamically and Hadoop continues to operate without interruption.
  • Another big advantage of Hadoop is that apart from being open source, it is compatible on all the platforms since it is Java based.
  • Applications- Recommendation systems, Video and image analysis, Log processing , analytics etc.

Ms. Palak Gupta

Assistant Professor

JIMS Kalkaji

Written by

Leave a Reply

Your email address will not be published. Required fields are marked *