map reduce program to display the intersection of two files

Question

Map Reduce program which takes two files as input and gives a set of words which are in both files(Intersection of two files.)

I tried like this ..

Map function : Takes the file as input and gives (word, 1) as output.. I got this output in a file , named as part-r-00000..This step i did for both files now i have two files(two part-r-00000 files.)

How can i give this files to Reduce function as input..

And give me some suggestions to write the reduce function for intersection of two files..

This is word count example program :

    package org.apache.hadoop.examples;

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
//import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class WordCountMap {

  public static class TokenizerMapper 
       extends Mapper<Object, Text, Text, IntWritable>{

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }

 /* public static class IntSumReducer 
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values, 
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  } */

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
    if (otherArgs.length != 2) {
      System.err.println("Usage: wordcount <in> <out>");
      System.exit(2);
    }
    Job job = new Job(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
   // job.setCombinerClass(IntSumReducer.class);
   // job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
    FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

Reducer class is in comment and all the lines related to reducer class is in comment but still i got a file part-r-00000.. And the output is

Hai 1 This 1 an 1 are 1 are 1 check 1 example 1 example 1 example 1 fair 1 file 1 ganesh 1 hadoop 1 how 1 hpw 1 is 1 is 1 is 1 map 1 not 1 only 1 program. 1 reduce 1 so 1 this 1 this 1 to 1 you 1 you 1

The `part-r-0000` files are from the reducer itself. Maybe in your map function you should use a flag to distinguish the data from 1st file and the second. And in the reducer use the flag to compare the values (this logic seems to have lot of redundancy) and `write` only those that are present in both — Suvarna Pattayil, Nov 18 '13 at 07:17
If you don't specify a Reducer it picks up the default Identity Reducer. This just takes the data from the mapper and prints as is. Since you are using the new API see [this](http://stackoverflow.com/questions/9746932/identityreducer-in-the-new-hadoop-api) — Suvarna Pattayil, Nov 19 '13 at 09:32

score 0 · Answer 1 · edited Mar 30 '15 at 15:10

0

You should mentioned job.setNumReduceTasks(0); in the driver code. So that part-r-00000 will not created.

I have tested like this. With job.setNumReduceTasks(0); and without Reducer logic then part-m-00000 got generatedcreated without job.setNumReduceTasks(0); and without Reducer logic then part-r-00000 got generated.

Put this above line and try and confirm.

edited Mar 30 '15 at 15:10

coderz

4,847
11
47
70

answered Jan 25 '14 at 14:52

Srinivas

1

map reduce program to display the intersection of two files

1 Answers1