I need to make a Term Frequency Map/Reduce with a twist:
- lowercase terms
- remove stop words
- stem words
- split into phrases
- count each phrase
- order by count desc
What I mean by split into phrases is as follows: say I have a title "david cleans rooms", I would like to have the following phrases counted in the results:
david david cleans david cleans rooms cleans cleans rooms rooms
I currently have the simple solutions which does not do any phrases, stop words or stemming:
var map = function() {
var summary = this.summary;
if (summary) {
// quick lowercase to normalize per your requirements
summary = summary.toLowerCase().split(" ");
for (var i = summary.length - 1; i >= 0; i--) {
// might want to remove punctuation, etc. here
if (summary[i]) { // make sure there's something
emit(summary[i], 1); // store a 1 for each word
}
}
}
};
var reduce = function( key, values ) {
var count = 0;
values.forEach(function(v) {
count +=v;
});
return count;
}
I am not sure if mongodb map/reduce can support stemming and stop words out of the box and how to put it all together.
Clarification: the result of the Map/Reduce will be a collection with the terms and their frequency. I need to keep this collection up to date (daily) so that we can see the most common terms used. I am hoping to run an initial M/R and then a daily update on the latest records.