Installing Hadoop on a Single Node Cluster in Pseudo-Distributed Mode
In this post we’ll see how to install Hadoop on a single node cluster in pseudo-distributed mode.Steps shown here are done on Ubuntu 16.04 and Hadoop version used is 2.9.0. Modes in Hadoop Before...
View ArticleWhat is Big Data
Big Data as the name suggests is data so huge, complex and ever growing that conventional technologies are not able to store or process that amount of data.Examples of Big DataSome of the examples of...
View ArticleIntroduction to Hadoop Framework
In the post What is Big Data it has already been discussed that the challenges such a huge data poses are in the form of- How to store such huge data.How to process it.This post gives introduction to...
View ArticleWhat is HDFS
When you store a file it is divided into blocks of fixed size, in case of local file system these blocks are stored in a single system. In a distributed file system these blocks of the file are stored...
View ArticleNameNode, DataNode And Secondary NameNode in HDFS
HDFS has a master/slave architecture. With in an HDFS cluster there is a single NameNode and a number of DataNodes, usually one per node in the cluster.In this post we'll see in detail what NameNode...
View ArticleReplica Placement Policy in Hadoop Framework
HDFS as the name says is a distributed file system which is designed to store large files. A large file is divided into blocks of defined size and these blocks are stored across machines in a cluster....
View ArticleHDFS Federation in Hadoop Framework
In this post we’ll talk about the HDFS Federation feature introduced in Hadoop 2.x versions. With HDFS federation we can have more than one NameNode in the Hadoop cluster each managing a part of the...
View ArticleWhat is SafeMode in Hadoop
When the NameNode starts in a Hadoop cluster, following tasks are performed by NameNode. NameNode reads the FsImage and EditLog from disk, applies all the transactions from the EditLog to the in-memory...
View ArticleHDFS High Availability
This post gives an overview of HDFS High Availability (HA), why it is required and how HDFS High Availability can be managed.Problem with single NameNodeTo guard against the vulnerability of having a...
View ArticleFile Read in HDFS - Hadoop Framework Internal Steps
In this post we’ll see what all happens internally with in the Hadoop framework when a file is read in HDFS.Reading file in HDFSWith in the Hadoop framework it is the DFSClient class which communicates...
View ArticleFile Write in HDFS - Hadoop Framework Internal Steps
In this post we’ll see what all happens internally with in the Hadoop framework when a file is written in HDFS.Writing file in HDFSWhen client application wants to create a file in HDFS it calls...
View ArticleJava Program to Read File in HDFS
In this post we’ll see a Java program to read a file in HDFS. You can read a file in HDFS in two ways-Create an object of FSDataInputStream and use that object to read data from file. You can use...
View ArticleHDFS Commands Reference List
In this post I have compiled a list of some frequently used HDFS commands along with examples.Here note that you can either use hadoop fs - <command> or hdfs dfs - <command>. The difference...
View ArticleJava Program to Write File in HDFS
In this post we’ll see a Java program to write a file in HDFS. You can write a file in HDFS in two ways-Create an object of FSDataOutputStream and use that object to write data to file. You can use...
View ArticleHow MapReduce Works in Hadoop
In the post Word Count MapReduce Program in Hadoop a word count MapReduce program is already written in Java. In this post, using that program as reference we’ll see how MapReduce works in Hadoop...
View ArticleWord Count MapReduce Program in Hadoop
The first MapReduce program most of the people write after installing Hadoop is invariably the word count MapReduce program. That’s what this post shows, writing word count MapReduce program in Java...
View ArticleYARN in Hadoop
YARN (Yet Another Resource Negotiator) is the cluster resource management and job scheduling layer of Hadoop. YARN is introduced in Hadoop 2.x version to address the scalability issues in MRv1. It also...
View ArticleFair Scheduler in YARN
In the post YARN in Hadoop we have already seen that it is the scheduler component of the ResourceManager which is responsible for allocating resources to the running jobs. The scheduler component is...
View ArticleCapacity Scheduler in YARN
In the post YARN in Hadoop we have already seen that it is the scheduler component of the ResourceManager which is responsible for allocating resources to the running jobs. The scheduler component is...
View ArticleUber Mode in Hadoop
When a MapReduce job is submitted, ResourceManager launches the ApplicationMaster process (For MapReduce the ApplicationMaster is MRAppMaster) on a container. Then ApplicationMaster retrieves the...
View Article