1

I am trying to work on a reducer, and the format of the input (key, value) pairs is as follows:

  • key: word
  • value: file=frequency, where "file" is the file which has that word, and "frequency" is the number of times the word appears in the file

The output from the reducer is a (key, value) pair of

  • key: word=file
  • value: tf-idf of that word in that file

The formula requires me to know 2 things before I can calculate the tf-idf

  • Number of files that contains the word (i.e. key)
  • The individual frequency of that word in the file

Somehow, it seems that I will have to loop through the values twice, once to get how many files contain the word, and another time to process the tf-idf.

Pseudo-code below:

//calculate tf-idf of every word in every document)
public static class CalReducer extends Reducer<Text, Text, Text, Text> {
    public void reduce(Text key, Iterable<Text> values, Context context)
            throws IOException, InterruptedException {
        // Note: key is a word, values are in the form of
        // (filename=frequency)

        // sum up the number of files containing a particular word

        // for every filename=frequency in the value, compute tf-idf of this
        // word in filename and output (word@filename, tfidf)
    }
}

I read that it is not possible to loop through the values twice. One alternative might be to use a "cache", which I tried but the results came out wonky.

vefthym
  • 7,422
  • 6
  • 32
  • 58
Jon
  • 71
  • 1
  • 1
  • 5

1 Answers1

0
Text outputKey = new Text(); 
Text outputValue = new Text();

//calculate tf-idf of every word in every document)
public static class CalReducer extends Reducer<Text, Text, Text, Text> {
    public void reduce(Text key, Iterable<Text> values, Context context)
            throws IOException, InterruptedException {
        // Note: key is a word, values are in the form of
        // (filename=frequency)
        Map<String, Integer> tfs = new HashMap<>();
        for (Text value: values) {
            String[] valueParts = value.split("=");
            tfs.put(valueParts[0], Integer.parseInt(valueParts[1])); //do the necessary checks here
        }
        int numDocs = context.getInt("noOfDocuments"); //set this in the Driver, if you know it already, or set a counter in the mapper to get it here using getCounter() 
        double IDF = Math.log10((double)numDocs/tfs.keySet().size());

        // for every filename=frequency in the value, compute tf-idf of this
        // word in filename and output (word@filename, tfidf)
        for (String file : tfs.keySet()) {
            outputKey.set(key.toString()+"@"+file);
            outputValue.set(new String(tfs.get(file)*IDF)); //you could also set the outputValue to be a DoubleWritable
            context.write(outputKey, outputValue);
        }
    }
}

In case you define tf as frequency / maxFrequency, you can find the maxFrequency in the first loop and change the outputValue accordingly.

If you want to give a try to a single-loop solution, you need to get the IDF, so you need to get the number of the input values. You may do the trick in Java 8 using:

long DF = values.spliterator().getExactSizeIfKnown();
double IDF = Math.log10((double)numDocs/DF);

as suggested in this post, or follow the other suggestions in the same post that do not use a loop (otherwise, you can follow the previous answer).

In this case, your code would be (I have not tried that):

Text outputKey = new Text(); 
Text outputValue = new Text();

//calculate tf-idf of every word in every document)
public static class CalReducer extends Reducer<Text, Text, Text, Text> {
    public void reduce(Text key, Iterable<Text> values, Context context)
            throws IOException, InterruptedException {
        int numDocs = context.getInt("noOfDocuments"); //set this in the Driver, if you know it already, or set a counter in the mapper to get it here using getCounter() 
        long DF = values.spliterator().getExactSizeIfKnown();
        double IDF = Math.log10((double)numDocs/DF);            

        // Note: key is a word, values are in the form of
        // (filename=frequency)
        for (Text value: values) {
            String[] valueParts = value.split("=");
            outputKey.set(key.toString()+"@"+valueParts[0]);
            outputValue.set(new String(Integer.parseInt(valueParts[1]) * IDF);
            context.write(outputKey, outputValue);
        }           

    }
}

This will also save some memory, as you won't need an additional Map (if it works, that is).

EDIT: The code above assumes that you already have the total frequency of each word for a filename, i.e., that the same filename does not appear in the values more than once, but you may want to check if it holds. Otherwise, the second solution does not work, as you have to make the total frequency sums per file in the first loop.

Community
  • 1
  • 1
vefthym
  • 7,422
  • 6
  • 32
  • 58