MongoDB Map Reduce Term Frequency with Stemming and phrases

Question

I need to make a Term Frequency Map/Reduce with a twist:

lowercase terms
remove stop words
stem words
split into phrases
count each phrase
order by count desc

What I mean by split into phrases is as follows: say I have a title "david cleans rooms", I would like to have the following phrases counted in the results:

david david cleans david cleans rooms cleans cleans rooms rooms

I currently have the simple solutions which does not do any phrases, stop words or stemming:

var map = function() {  
    var summary = this.summary;
    if (summary) { 
        // quick lowercase to normalize per your requirements
        summary = summary.toLowerCase().split(" "); 
        for (var i = summary.length - 1; i >= 0; i--) {
            // might want to remove punctuation, etc. here
            if (summary[i])  {      // make sure there's something
               emit(summary[i], 1); // store a 1 for each word
            }
        }
    }
};

var reduce = function( key, values ) {    
    var count = 0;    
    values.forEach(function(v) {            
        count +=v;    
    });
    return count;
}

I am not sure if mongodb map/reduce can support stemming and stop words out of the box and how to put it all together.

Clarification: the result of the Map/Reduce will be a collection with the terms and their frequency. I need to keep this collection up to date (daily) so that we can see the most common terms used. I am hoping to run an initial M/R and then a daily update on the latest records.

A Map Reduce would be a very bad way of doing this, Map Reduce is really not designed to run inline to your application. You should either look into MongoDBs new FTS abilities or get an FTS tech — Sammaye, Jan 08 '14 at 11:24
Full Text Search allows to index and search on a collection. But I am not sure it can create a collection with term/count that I can work with. — checklist, Jan 08 '14 at 11:31
Why do you need a physical collection for that, the full text index will do the sorting and stuff required, you won't need to hold any data yourself — Sammaye, Jan 08 '14 at 11:47
I need the data (tens of thousands of result documents) to display, sort etc. Also FTS is a beta feature and not ready for production. — checklist, Jan 08 '14 at 15:49
You want facets and the sort etc comes inbuilt, honestly this is a really bad route, go for an actual FTS tech — Sammaye, Jan 08 '14 at 15:49
@checklist I was wondering what have you chosen as a solution as I am in the same road right now. The MongoDB 2.6 FTS has all the features but if only you want to search and only meta it will let you use is Score. So it can't be used in TF-IDF. tnx — Maziyar, Aug 12 '14 at 02:56
@Maziya - we decided to go with solr using Spring-Data-Solr. MongoDB just didn't make that for us. OF course Solr has its own issues. — checklist, Aug 12 '14 at 06:23

score 0 · Answer 1 · edited May 23 '17 at 11:58

The feature list you have described is exactly what MongoDB's text index provides:

language-based stemming
case-insensitive indexing
option to set relative field weights
search for single terms, multiple terms, or phrases
results returned scored by relevance

If you want to build your own full text search, Map/Reduce is not the best approach. For a basic solution you would be better to iterate your documents using a normal find() and build a multi-key index based on your keyword search requirements.

Outside of MongoDB there are other text search options that can be integrated to provide more advanced search options such as facets, clustering, and keyword proximity.

Thanks. I am aware of FTS features. But that is an index. And what I need is a resulting collection that I can work with and most importantly, a count of how many Terms / Frequency. — checklist, Jan 08 '14 at 15:51

score 0 · Answer 2 · answered May 14 '14 at 23:10

0

in the last version mongodb 2.6 include FTS (no beta, release). This version include stemming with snowball, stopwords for each language.

answered May 14 '14 at 23:10

jvea

59
4

But the only meta it will let you use at this moment is Score not anything else. It is not like you can take advantage of stemming ,stop words, etc. in you aggregation. It can be only used in text match. – Maziyar Aug 12 '14 at 02:53

MongoDB Map Reduce Term Frequency with Stemming and phrases

2 Answers2