Your First Map Reduce App by Intellij and integration with docker

 Apr, 24 - 2017   no comments   BigDataClouderaDevOpsDockerHadoopMapReduce


Please note that this article is a completion to Hadoop Inside Docker, the easiest way in 5 minutes

 

First, open your Intellij IDEA and create new maven project File | New | Maven then specify ArtifactId, GroupId, and Project name in the following steps (e.g. ArtifactId=com.example, GroupId=my-analysis-project,ProjectName=AnalysisProject)

HINT: Choose where to save your project and it should be the same as your mounted volume you have already chosen when spinning up new container before for example, if you specified as docker run .... -v /Users/msoliman/IdeaProjects:/home/cloudera/projects ...  then you should save your project inside  /Users/msoliman/IdeaProjects to be able to access and execute your project’s generated jar files inside docker container by map reduce framework. also you should notice that “/home/cloudera/projects” would be created for you if it is not existing.

 

Then edit pom.xml file in the root under the project and add the following dependencies as shown below and make sure to click on “Import Changes” in the bottom right.

Now let’s make sure we have all JDK compatible, as it is usually make problems when using incompatible versions of JDK, open File | Preferences and search for maven, make sure you have the same JDK version in both “Importing” and “Runner” tabs. also under File | Project Structure make sure it is compatible with the same version you have selected in maven, check all screenshots below for better understanding:

       

Let’s build artifact for this project, simply it generates the jar file for us to be used or executed in hadoop node later as a map reduce job. Open File | Project Structure | Artifacts tab choose Add | Jar | From modules with dependencies ...its name would be by default named as your project name “e.g. Analysis:jar”

Now let’s create the word count example that is considered the hello world in Hadoop, create a package under “src/main/java” and name it “WordCount” then add a new Java Class and name it “Main.java” it should be like the following:

package WordCount;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import java.io.IOException;
import java.util.StringTokenizer;
/**
* Created by msoliman on 4/18/17.
*/
public class Main {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setJarByClass(Main.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}

Now let’s build our artifact from Build | Build Artifacts...

As we mounted our project folder inside /home/cloudera/projects/ do the following highlighting that you have to change to the same project name you have in your machine as it may be different than mine.

# cd /home/cloudera/projects/Analysis/out/articats/Analysis_jar
# echo "this is a simple document having simple text to count how many words it has" > test
# su hdfs 
# hdfs dfs -mkdir -p /user/cloudera/input/wc
# hdfs dfs -mkdir -p /user/cloudera/output/wc
# hdfs dfs -copyFromLocal /user/cloudera/input/wc/test
# hadoop -jar Analysis.jar WordCount.Main /user/cloudera/input/wc /user/cloudera/output/wc/1

 


Related articles