Text outputKey = new Text();
Text outputValue = new Text();
//calculate tf-idf of every word in every document)
public static class CalReducer extends Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
// Note: key is a word, values are in the form of
// (filename=frequency)
Map<String, Integer> tfs = new HashMap<>();
for (Text value: values) {
String[] valueParts = value.split("=");
tfs.put(valueParts[0], Integer.parseInt(valueParts[1])); //do the necessary checks here
}
int numDocs = context.getInt("noOfDocuments"); //set this in the Driver, if you know it already, or set a counter in the mapper to get it here using getCounter()
double IDF = Math.log10((double)numDocs/tfs.keySet().size());
// for every filename=frequency in the value, compute tf-idf of this
// word in filename and output (word@filename, tfidf)
for (String file : tfs.keySet()) {
outputKey.set(key.toString()+"@"+file);
outputValue.set(new String(tfs.get(file)*IDF)); //you could also set the outputValue to be a DoubleWritable
context.write(outputKey, outputValue);
}
}
}
In case you define tf as frequency / maxFrequency
, you can find the maxFrequency in the first loop and change the outputValue
accordingly.
If you want to give a try to a single-loop solution, you need to get the IDF
, so you need to get the number of the input values
.
You may do the trick in Java 8 using:
long DF = values.spliterator().getExactSizeIfKnown();
double IDF = Math.log10((double)numDocs/DF);
as suggested in this post, or follow the other suggestions in the same post that do not use a loop (otherwise, you can follow the previous answer).
In this case, your code would be (I have not tried that):
Text outputKey = new Text();
Text outputValue = new Text();
//calculate tf-idf of every word in every document)
public static class CalReducer extends Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
int numDocs = context.getInt("noOfDocuments"); //set this in the Driver, if you know it already, or set a counter in the mapper to get it here using getCounter()
long DF = values.spliterator().getExactSizeIfKnown();
double IDF = Math.log10((double)numDocs/DF);
// Note: key is a word, values are in the form of
// (filename=frequency)
for (Text value: values) {
String[] valueParts = value.split("=");
outputKey.set(key.toString()+"@"+valueParts[0]);
outputValue.set(new String(Integer.parseInt(valueParts[1]) * IDF);
context.write(outputKey, outputValue);
}
}
}
This will also save some memory, as you won't need an additional Map (if it works, that is).
EDIT: The code above assumes that you already have the total frequency of each word for a filename, i.e., that the same filename does not appear in the values more than once, but you may want to check if it holds. Otherwise, the second solution does not work, as you have to make the total frequency sums per file in the first loop.