Apache Hadoop introduction and overview

Open-source software generally has a source code, and anyone aware of the source code can access it, modify it and bring necessary changes. Source code is the code that programmers operate to change or re-design the working of the computer software by adding new features or upgrading the existing features. VLC Media player, Mozilla Firefox, Linux, Apache web server, etc., are examples of some of the most popular OSS.

Apache Hadoop is open-source software that helps in large-scale data processing and storage. It facilitates the working of multiple computers, forming a network solving problems related to data and computation. It uses the MapReduce Programming Model for the storage and processing of extensive data. Being licensed under Apache License 2.0, Hadoop is one of the premium level projects by Apache being extensively used by the users all around.

About Apache Hadoop:-

Hadoop is a premium project of the Apache group, licensed under Apache License 2.0 developed by Doug Cutting. He was then working in Yahoo and now is the chief architect of Cloudera and Mike Cafarella back in 2005. Initially, the project was being developed as Apache Nutch for the Nutch search engine, owing to the Google File System paper introduced in 2003. But in January 2006, the Hadoop sub-project named after Doug Cutting’s son’s toy elephant came into existence and thus was developed. Hadoop 0.1.0 was discovered in 2006 and thus continued to evolve.

The framework of the Apache Module:-

The Apache Hadoop framework provides the following modules:-

  1. The Hadoop standard package provides the file system, operating system, and other utilities required by the Hadoop Module.
  2. Hadoop Distributed File System- This file system stores data and provides a substantial bandwidth. Any file system compatible with Hadoop should give the details like the name of the rack and other location awareness. The Hadoop file system utilizes this while copying the data and prevents rack power outage or switch malfunction; the data will remain safely stored even if this happens.
  3. Hadoop Yarn- This is a platform where resource management takes place, and this platform handles the computer’s resources and uses them to schedule the user’s application.
  4. Hadoop MapReduce- Hadoop’s programming model for a large-scale and fast data processing system.

The Apache Hadoop Software system is designed to tackle all the problems that affect the hardware automatically without causing any pain to the user. People do it with the rudimentary assumption that hardware failures and related issues are regular and easily managed without the user’s concern. The Apache Hadoop Distributed File System and the MapReduce program feature date back to the Google File System Paper introduced in 2003 and Google MapReduce.

HDFS and MapReduce:-

Hadoop consists of two central parts an HDFS and MapReduce program model wherein HDFS stores the data and the MapReduce processes the stored large-scale data. Hadoop Distributed File System is a portable, distributable file system programming using the Java language and requires Java Runtime Environment. The Hadoop system consists of multiple worker nodes and a single master node. The master node further consists of the Job tracker, task tracker that works for MapReduce, name node, and a data node.

The Name node is the primary node, and the data node works as its corresponding node and can communicate with each other.

  1. Name Node- the name node can manage files, track files, and keep details about the stored data like the number of blocks, the node where data is stored, the location of the copied data, and contact with the client.
  2. Data Nodes- as the HDFS is entitled to store data, the data node does the task of keeping in blocks and sends a message to the master node every 3 seconds, and if it doesn’t name node considers it dead and finds a new data node.
  3. Secondary Name Node- It acts as a checkpoint and assists the name node by instructing it to send edited files, which it compacts and stores.
  4. Job tracker- this function receives orders requests for MapReduce execution, communicates with the name node, and finds access to the stored data location through the metadata.
  5. Task Tracker- this receives orders from the job tracker and is secondary to it. It receives codes from job trackers and applies them to the required files, the process being known as Mapper.

Features of Apache Hadoop:-

The Apache Hadoop has been evolving continuously and gaining popularity because of the features like:-

  1. Fault-Tolerant and readily available- the HDFS replicates the data in two other locations, and thus even in case of collapsing, there is no fault, and the MapReduce engine helps in the availability of data in failure cases.
  2. Data Reliability- the data stored and replicated in two to three locations and is available in case of system failure. It also provides the disc checker and block scanner facilities, making data available even if corrupted.
  3. Robust Ecosystem- it comes with a vast ecosystem fulfilling the analytical requirement of the clients. Also, it comes with tools that cater to the need of all types of large-scale and massive data processing, such as MapReduce, YARN, etc., and continues to add on more over time.
  4. Open Source- the striking feature of Hadoop is it is open source that means the source code can be modified and changed accordingly.

Positive aspects of Hadoop:-

  • Scalable- Hadoop is scalable as it provides large-scale data storage and distribution across various servers operating simultaneously
  • Cost-effective- the other software was too costly for storing and processing a large volume of data and down-sampling, leading to the deletion of raw data. But with Apache, massive data is processed cost-effectively.
  • Failure resistant- Hadoop is highly faulting resistant as it stores data in various places, preventing data loss and has backup data ready in case of hardware failure.
  • Fast- the unique system of the HDFS maps data wherever it is located and primarily data stored in the same servers, making it faster even when the volume is large.

Drawbacks of the Apache Hadoop:-

  1. Small files- the HDFS is designed to work with high-capacity files, and thus, it is difficult to process the small files more diminutive than the HDFS block size of 128 MB. Therefore it becomes difficult to deal with large no. of small files compared to small no. of large files. But in this case, data can be merged by staff to create large files.
  2. Real-time processing- Apache is designed for batch processing wherein it takes a lot of data, processes it, and produces the result. Still, depending on size, the output is delayed as there is no real-time processing.
  3. Security- Hadoop faces an issue handling complex applications as it does not provide encryption at the storage and network level, which is a threat to security. The Kerberos authentication is hard to manage.
  4. Language- cybercriminals exploit JAVA being a prevalent language, and Apache Hadoop uses the JAVA language.

Apache Hadoop- a larger vision:-

Hadoop has been put to prominent use several times, starting with Yahoo in February 2009. Yahoo Search Web map is the most extensive Hadoop application that ran on Linux and produced data used by people in every Yahoo search with 10000 cores. In 2010, Facebook claimed to have the most significant Hadoop application with storage of 21 PB, which later increased to 100 PB and kept on increasing half PB per day. And by 2013, Apache Hadoop became one of the popular software being used by 50 large Fortune Companies, thus serving its purpose.


Share on facebook
Share on twitter
Share on google
Share on linkedin
Share on pinterest

Related posts