Quantcast
Channel: Tech Tutorials
Viewing all articles
Browse latest Browse all 862

Word Count MapReduce Program in Hadoop

$
0
0

The first MapReduce program most of the people write after installing Hadoop is invariably the word count MapReduce program.

That’s what this post shows, writing word count MapReduce program in Java programming language, IDE used is Eclipse.

Creating and copying file

If you already have a file in HDFS which you want to use as input then you can skip this step.

First thing is to create a file which will be used as input and copy it to HDFS.

Let’s say you have a file wordcount.txt with the following content.


Hello wordcount MapReduce Hadoop program.
This is my first MapReduce program.

You want to copy this file to /user/process directory with in HDFS. If that path doesn’t exist then you need to create those directories first.


hdfs dfs -mkdir -p /user/process

Then copy the file wordcount.txt to this directory.


hdfs dfs -put /netjs/MapReduce/wordcount.txt /user/process

Word count example MapReduce code

Now you can write your wordcount MapReduce code. WordCount example reads text files and counts the frequency of the words. Each mapper takes a line of the input file as input and breaks it into words. It then emits a key/value pair of the word (In the form of (word, 1)) and each reducer sums the counts for each word and emits a single key/value with the word and sum.

In the code there is a Mapper class (MyMapper) with map function and a Reducer class (MyReducer) with a reduce function.


import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
// Map function
public static class MyMapper extends Mapper<LongWritable, Text, Text, IntWritable>{
private Text word = new Text();
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
// Splitting the line on spaces
String[] stringArr = value.toString().split("\\s+");
for (String str : stringArr) {
word.set(str);
context.write(word, new IntWritable(1));
}
}
}

// Reduce function
public static class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception{
Configuration conf = new Configuration();

Job job = Job.getInstance(conf, "WC");
job.setJarByClass(WordCount.class);
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

You will also need to add at least the following Hadoop jars so that your code can compile. You will find these jars inside the /share/hadoop directory of your Hadoop installation. With in /share/hadoop path look in hdfs, mapreduce and common directories for required jars.


hadoop-common-2.9.0.jar
hadoop-hdfs-2.9.0.jar
hadoop-hdfs-client-2.9.0.jar
hadoop-mapreduce-client-core-2.9.0.jar
hadoop-mapreduce-client-common-2.9.0.jar
hadoop-mapreduce-client-jobclient-2.9.0.jar
hadoop-mapreduce-client-hs-2.9.0.jar
hadoop-mapreduce-client-app-2.9.0.jar
commons-io-2.4.jar

Creating jar of your MapReduce code

Once you are able to compile your code you need to create jar file. In the eclipse IDE righ click on your Java program and select Export – Java – jar file.

Running the code

You can use the following command to run the program. Assuming you are in your hadoop installation directory.


bin/hadoop jar /netjs/MapReduce/wordcount.jar org.netjs.WordCount /user/process /user/out

/netjs/MapReduce/wordcount.jar is the path to your jar file.

org.netjs.WordCount is the fully qualified path to your Java program class.

/user/process– path to input directory.

/user/out– path to output directory.

One your word count MapReduce program is succesfully executed you can verify the output file.


hdfs dfs -ls /user/out

Found 2 items
-rw-r--r-- 1 netjs supergroup 0 2018-02-27 13:37 /user/out/_SUCCESS
-rw-r--r-- 1 netjs supergroup 77 2018-02-27 13:37 /user/out/part-r-00000

As you can see Hadoop framework creates output files using part-r-xxxx format. Since only one reducer is used here so there is only one output file part-r-00000. You can see the content of the file using the following command.


hdfs dfs -cat /user/out/part-r-00000

Hadoop 1
Hello 1
MapReduce 2
This 1
first 1
is 1
my 1
program. 2
wordcount 1

That's all for this topic Word Count MapReduce Program in Hadoop. If you have any doubt or any suggestions to make please drop a comment. Thanks!


Related Topics

  1. Introduction to Hadoop Framework
  2. MapReduce Flow in YARN
  3. Speculative Execution in Hadoop
  4. What is HDFS
  5. How to Compress Intermediate Map Output in Hadoop

You may also like -

>>>Go to Hadoop Framework Page


Viewing all articles
Browse latest Browse all 862

Trending Articles