Quantcast
Channel: Tech Tutorials
Viewing all articles
Browse latest Browse all 862

Installing Hadoop on a Single Node Cluster in Pseudo-Distributed Mode

$
0
0

In this post we’ll see how to install Hadoop on a single node cluster in pseudo-distributed mode.

Steps shown here are done on Ubuntu 16.04 and Hadoop version used is 2.9.0.

Modes in Hadoop

Before starting installation of Hadoop let’s have a look what all modes are supported for running Hadoop.

  1. Local (Standalone) Mode– This is the default configuration mode for Hadoop where Hadoop runs in a non-distributed mode, as a single Java process. This mode is useful for debugging.
  2. Pseudo-Distributed Mode - Hadoop can also be run on a single-node in a pseudo-distributed mode. In pseudo-distributed mode each Hadoop daemon runs in a separate Java process but on a single node.
  3. Fully-Distributed Mode - In fully-distributed mode Hadoop runs on clusters ranging from a few nodes to extremely large clusters with thousands of nodes.

Required Software

For Hadoop installation following softwares are required-

  1. Java must be installed. To check for the compatible version of Java for the Hadoop version you are installing refer https://wiki.apache.org/hadoop/HadoopJavaVersions.
  2. ssh must be installed and sshd must be running to manage Hadoop daemons running as separate Java processes.

Steps for Hadoop installation

Steps to install Hadoop on a single node cluster in pseudo-distributed mode are as follows-

1- Check for Java installation– As already stated Java is required for running Hadoop so ensure that Java is installed and the version of Java is compatible with the Hadoop version.

2- Downloading Hadoop tar ball and unpacking it– You can download the stable version of Hadoop from this location - http://hadoop.apache.org/releases.html

Downloaded tar ball will be in the form Hadoop-xxx.tar.gz so you need to unpack the tar ball. For that do the following things -

2.1 Create a new directory– Create a new directory and move the hadoop tar ball there.

sudo mkdir /usr/local/hadoop

2.2 Move Hadoop installation and untar it– Move Hadoop installation files from Downloads directory to /usr/local/hadoop and unpack it.

Change directory to Downloads and run the following command from there.

sudo cp -R hadoop-2.9.0.tar.gz /usr/local/hadoop 

Ensure the correct version of Hadoop in your command.

Now you have Hadoop tar ball in your created directory /usr/local/hadoop.

To unpack it run the following command after changing directory to /usr/local/hadoop.

 tar zxvf hadoop-2.9.0.tar.gz 

3- Installing and setting up passphraseless ssh - Hadoop uses SSH to remotely login to nodes. Even in the case of single node cluster daemons run as separate Java processes so we do need to install and configure SSH. For single node host will be localhost.

To install ssh run the following command –

sudo apt-get install ssh 

Every time you will try to connect to localhost you will be asked for a passphrase, to avoid that set an empty passphrase so that you are not prompted for passphrase every time.

Command to generate key with an empty passphrase-

 ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa 

You also need to add this generated key to the list of authorized keys, run the following command to do that.

cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys 

Try connecting now using the command ssh localhost, you should not be asked for the passphrase if everything is configured correctly.

4- Setting environment variables and paths- In order to run Hadoop it should know the location of Java installation on your system. For that you need to create a JAVA_HOME variable. You can set that in ~/.bashrc file or in etc/hadoop/hadoop-env.sh file which resides in your Hadoop installation. If you are adding it in haddop-env.sh file then edit that file and at the end add the following line.

export JAVA_HOME=/usr/local/java/jdk1.8.0_151 

Here /usr/local/java/jdk1.8.0_151 must be the path to your Java installation.

You can also add HADOOP_HOME environment variable pointing to your Hadoop installation and add the path to bin and sbin directories too. That will help you to run Hadoop commands from anywhere. To add it to /etc/environment run the following commands.

Edit /etc/environment file


sudo gedit /etc/environment

Add HADOOP_HOME variable at the end of the file.

HADOOP_HOME="/usr/hadoop/hadoop-2.9.0"

Add the following to the existing PATH variable -

:/usr/hadoop/hadoop-2.9.0/bin:/usr/hadoop/hadoop-2.9.0/sbin 

To reload the environment file run the following command-

source /etc/environment 

Run hadoop version command to ensure everything is configured properly. If there is no problem till now then running the command should give you hadoop version information.


$ hadoop version

Hadoop 2.9.0
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r
0D8ebc8394f483xy25feac05fu478f6d612e6c50
Compiled by jjohn on 2017-11-15T22:16Z
Compiled with protoc 2.5.0
From source with checksum 0b71a9c67a5227390741f8d5931y175
This command was run using
/usr/hadoop/hadoop-2.9.0/share/hadoop/common/hadoop-common-2.9.0.jar

5- Setting configuration files - You need to change XML files placed inside /etc/hadoop directory with in our Hadoop installation folder. XML files that are to be changed and changes required are listed here.

/etc/hadoop/core-site.xml

You can override the default settings used to start Hadoop by changing this file.


<property>
<name>hadoop.tmp.dir</name>
<value>/usr/tmp</value>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>

 The directory you choose for hadoop.tmp.dir parameter has to be created by you. You would also need to provide permission for read and write to that directory based on your user privileges. If you don’t use this property hadoop.tmp.dir, Hadoop framework will create tmp directory by default.

/etc/hadoop/hdfs-site.xml


<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

For Yarn settings to run MapReduce job-

/etc/hadoop/mapred-site.xml


<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

etc/hadoop/yarn-site.xml

<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>

6- Format HDFS file system– You also need to format the HDFS filesystem once. Run the following command to do that.

hdfs namenode -format 

7- Starting daemons– Start the HDFS and YARN daemons by executing the following shell scripts -

start-dfs.sh

start-yarn.sh

You will find these shell scripts in the sbin directory with in your Hadoop installation.

Use the jps command to verify that all the daemons are running.


$ jps

6294 NodeManager
6168 ResourceManager
6648 Jps
5997 SecondaryNameNode
5758 DataNode
5631 NameNode

To stop the daemons use the following shell scripts.

stop-dfs.sh

stop-yarn.sh

8- Browse the web interface– You can also check the web interfaces for Namenode and YARN resource manager after the daemons are started.

NameNode– http://localhost:50070/

ResourceManager– http://localhost:8088/

Reference - https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html#Pseudo-Distributed_Operation

That's all for this topic Installing Hadoop on a Single Node Cluster in Pseudo-Distributed Mode. If you have any doubt or any suggestions to make please drop a comment. Thanks!


Related Topics

  1. Installing Ubuntu Along With Windows
  2. How to Install Java on Ubuntu
  3. What is Big Data
  4. Introduction to Hadoop Framework
  5. Word Count MapReduce Program in Hadoop

You may also like -

>>>Go to Hadoop Framework Page


Viewing all articles
Browse latest Browse all 862

Trending Articles