The first MapReduce program most of the people write after installing Hadoop is invariably the word count MapReduce program.
That’s what this post shows, writing word count MapReduce program in Java programming language, IDE used is Eclipse.
Creating and copying file
If you already have a file in HDFS which you want to use as input then you can skip this step.
First thing is to create a file which will be used as input and copy it to HDFS.
Let’s say you have a file wordcount.txt with the following content.
Hello wordcount MapReduce Hadoop program.
This is my first MapReduce program.
You want to copy this file to /user/process directory with in HDFS. If that path doesn’t exist then you need to create those directories first.
hdfs dfs -mkdir -p /user/process
- Refer HDFS Commands Reference List for HDFS commands.
Then copy the file wordcount.txt to this directory.
hdfs dfs -put /netjs/MapReduce/wordcount.txt /user/process
Word count example MapReduce code
Now you can write your wordcount MapReduce code. WordCount example reads text files and counts the frequency of the words. Each mapper takes a line of the input file as input and breaks it into words. It then emits a key/value pair of the word (In the form of (word, 1)) and each reducer sums the counts for each word and emits a single key/value with the word and sum.
- Refer How MapReduce Works in Hadoop to see in detail how data is processed as (key, value) pairs in map and reduce tasks.
In the code there is a Mapper class (MyMapper) with map function and a Reducer class (MyReducer) with a reduce function.
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
// Map function
public static class MyMapper extends Mapper<LongWritable, Text, Text, IntWritable>{
private Text word = new Text();
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
// Splitting the line on spaces
String[] stringArr = value.toString().split("\\s+");
for (String str : stringArr) {
word.set(str);
context.write(word, new IntWritable(1));
}
}
}
// Reduce function
public static class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception{
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "WC");
job.setJarByClass(WordCount.class);
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
You will also need to add at least the following Hadoop jars so that your code can compile. You will find these jars inside the /share/hadoop directory of your Hadoop installation. With in /share/hadoop path look in hdfs, mapreduce and common directories for required jars.
hadoop-common-2.9.0.jar
hadoop-hdfs-2.9.0.jar
hadoop-hdfs-client-2.9.0.jar
hadoop-mapreduce-client-core-2.9.0.jar
hadoop-mapreduce-client-common-2.9.0.jar
hadoop-mapreduce-client-jobclient-2.9.0.jar
hadoop-mapreduce-client-hs-2.9.0.jar
hadoop-mapreduce-client-app-2.9.0.jar
commons-io-2.4.jar
Creating jar of your MapReduce code
Once you are able to compile your code you need to create jar file. In the eclipse IDE righ click on your Java program and select Export – Java – jar file.
Running the code
You can use the following command to run the program. Assuming you are in your hadoop installation directory.
bin/hadoop jar /netjs/MapReduce/wordcount.jar org.netjs.WordCount /user/process /user/out
/netjs/MapReduce/wordcount.jar is the path to your jar file.
org.netjs.WordCount is the fully qualified path to your Java program class.
/user/process– path to input directory.
/user/out– path to output directory.
One your word count MapReduce program is succesfully executed you can verify the output file.
hdfs dfs -ls /user/out
Found 2 items
-rw-r--r-- 1 netjs supergroup 0 2018-02-27 13:37 /user/out/_SUCCESS
-rw-r--r-- 1 netjs supergroup 77 2018-02-27 13:37 /user/out/part-r-00000
As you can see Hadoop framework creates output files using part-r-xxxx format. Since only one reducer is used here so there is only one output file part-r-00000. You can see the content of the file using the following command.
hdfs dfs -cat /user/out/part-r-00000
Hadoop 1
Hello 1
MapReduce 2
This 1
first 1
is 1
my 1
program. 2
wordcount 1
That's all for this topic Word Count MapReduce Program in Hadoop. If you have any doubt or any suggestions to make please drop a comment. Thanks!
Related Topics
You may also like -