In this post we’ll see how to install Hadoop on a single node cluster in pseudo-distributed mode.
Steps shown here are done on Ubuntu 16.04 and Hadoop version used is 2.9.0.
Modes in Hadoop
Before starting installation of Hadoop let’s have a look what all modes are supported for running Hadoop.
- Local (Standalone) Mode– This is the default configuration mode for Hadoop where Hadoop runs in a non-distributed mode, as a single Java process. This mode is useful for debugging.
- Pseudo-Distributed Mode - Hadoop can also be run on a single-node in a pseudo-distributed mode. In pseudo-distributed mode each Hadoop daemon runs in a separate Java process but on a single node.
- Fully-Distributed Mode - In fully-distributed mode Hadoop runs on clusters ranging from a few nodes to extremely large clusters with thousands of nodes.
Required Software
For Hadoop installation following softwares are required-
- Java must be installed. To check for the compatible version of Java for the Hadoop version you are installing refer https://wiki.apache.org/hadoop/HadoopJavaVersions.
- ssh must be installed and sshd must be running to manage Hadoop daemons running as separate Java processes.
Steps for Hadoop installation
Steps to install Hadoop on a single node cluster in pseudo-distributed mode are as follows-
1- Check for Java installation– As already stated Java is required for running Hadoop so ensure that Java is installed and the version of Java is compatible with the Hadoop version.
- Refer How to Install Java on Ubuntu to see how to install Java on Ubuntu.
2- Downloading Hadoop tar ball and unpacking it– You can download the stable version of Hadoop from this location - http://hadoop.apache.org/releases.html
Downloaded tar ball will be in the form Hadoop-xxx.tar.gz so you need to unpack the tar ball. For that do the following things -
2.1 Create a new directory– Create a new directory and move the hadoop tar ball there.
sudo mkdir /usr/local/hadoop
2.2 Move Hadoop installation and untar it– Move Hadoop installation files from Downloads directory to /usr/local/hadoop and unpack it.
Change directory to Downloads and run the following command from there.
sudo cp -R hadoop-2.9.0.tar.gz /usr/local/hadoop
Ensure the correct version of Hadoop in your command.
Now you have Hadoop tar ball in your created directory /usr/local/hadoop.
To unpack it run the following command after changing directory to /usr/local/hadoop.
tar zxvf hadoop-2.9.0.tar.gz
3- Installing and setting up passphraseless ssh - Hadoop uses SSH to remotely login to nodes. Even in the case of single node cluster daemons run as separate Java processes so we do need to install and configure SSH. For single node host will be localhost.
To install ssh run the following command –
sudo apt-get install ssh
Every time you will try to connect to localhost you will be asked for a passphrase, to avoid that set an empty passphrase so that you are not prompted for passphrase every time.
Command to generate key with an empty passphrase-
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
You also need to add this generated key to the list of authorized keys, run the following command to do that.
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
Try connecting now using the command ssh localhost, you should not be asked for the passphrase if everything is configured correctly.
4- Setting environment variables and paths- In order to run Hadoop it should know the location of Java installation on your system. For that you need to create a JAVA_HOME variable. You can set that in ~/.bashrc file or in etc/hadoop/hadoop-env.sh file which resides in your Hadoop installation. If you are adding it in haddop-env.sh file then edit that file and at the end add the following line.
export JAVA_HOME=/usr/local/java/jdk1.8.0_151
Here /usr/local/java/jdk1.8.0_151 must be the path to your Java installation.
You can also add HADOOP_HOME environment variable pointing to your Hadoop installation and add the path to bin and sbin directories too. That will help you to run Hadoop commands from anywhere. To add it to /etc/environment run the following commands.
Edit /etc/environment file
sudo gedit /etc/environment
Add HADOOP_HOME variable at the end of the file.
HADOOP_HOME="/usr/hadoop/hadoop-2.9.0"
Add the following to the existing PATH variable -
:/usr/hadoop/hadoop-2.9.0/bin:/usr/hadoop/hadoop-2.9.0/sbin
To reload the environment file run the following command-
source /etc/environment
Run hadoop version command to ensure everything is configured properly. If there is no problem till now then running the command should give you hadoop version information.
$ hadoop version
Hadoop 2.9.0
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r
0D8ebc8394f483xy25feac05fu478f6d612e6c50
Compiled by jjohn on 2017-11-15T22:16Z
Compiled with protoc 2.5.0
From source with checksum 0b71a9c67a5227390741f8d5931y175
This command was run using
/usr/hadoop/hadoop-2.9.0/share/hadoop/common/hadoop-common-2.9.0.jar
5- Setting configuration files - You need to change XML files placed inside /etc/hadoop directory with in our Hadoop installation folder. XML files that are to be changed and changes required are listed here.
/etc/hadoop/core-site.xml
You can override the default settings used to start Hadoop by changing this file.
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/tmp</value>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
The directory you choose for hadoop.tmp.dir parameter has to be created by you. You would also need to provide permission for read and write to that directory based on your user privileges. If you don’t use this property hadoop.tmp.dir, Hadoop framework will create tmp directory by default.
/etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
For Yarn settings to run MapReduce job-
/etc/hadoop/mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
etc/hadoop/yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
6- Format HDFS file system– You also need to format the HDFS filesystem once. Run the following command to do that.
hdfs namenode -format
7- Starting daemons– Start the HDFS and YARN daemons by executing the following shell scripts -
start-dfs.sh
start-yarn.sh
You will find these shell scripts in the sbin directory with in your Hadoop installation.
Use the jps command to verify that all the daemons are running.
$ jps
6294 NodeManager
6168 ResourceManager
6648 Jps
5997 SecondaryNameNode
5758 DataNode
5631 NameNode
To stop the daemons use the following shell scripts.
stop-dfs.sh
stop-yarn.sh
8- Browse the web interface– You can also check the web interfaces for Namenode and YARN resource manager after the daemons are started.
NameNode– http://localhost:50070/
ResourceManager– http://localhost:8088/
That's all for this topic Installing Hadoop on a Single Node Cluster in Pseudo-Distributed Mode. If you have any doubt or any suggestions to make please drop a comment. Thanks!
Related Topics
You may also like -