2

I need to make a Term Frequency Map/Reduce with a twist:

  1. lowercase terms
  2. remove stop words
  3. stem words
  4. split into phrases
  5. count each phrase
  6. order by count desc

What I mean by split into phrases is as follows: say I have a title "david cleans rooms", I would like to have the following phrases counted in the results:

david david cleans david cleans rooms cleans cleans rooms rooms

I currently have the simple solutions which does not do any phrases, stop words or stemming:

var map = function() {  
    var summary = this.summary;
    if (summary) { 
        // quick lowercase to normalize per your requirements
        summary = summary.toLowerCase().split(" "); 
        for (var i = summary.length - 1; i >= 0; i--) {
            // might want to remove punctuation, etc. here
            if (summary[i])  {      // make sure there's something
               emit(summary[i], 1); // store a 1 for each word
            }
        }
    }
};

var reduce = function( key, values ) {    
    var count = 0;    
    values.forEach(function(v) {            
        count +=v;    
    });
    return count;
}

I am not sure if mongodb map/reduce can support stemming and stop words out of the box and how to put it all together.

Clarification: the result of the Map/Reduce will be a collection with the terms and their frequency. I need to keep this collection up to date (daily) so that we can see the most common terms used. I am hoping to run an initial M/R and then a daily update on the latest records.

checklist
  • 12,340
  • 15
  • 58
  • 102
  • A Map Reduce would be a very bad way of doing this, Map Reduce is really not designed to run inline to your application. You should either look into MongoDBs new FTS abilities or get an FTS tech – Sammaye Jan 08 '14 at 11:24
  • Full Text Search allows to index and search on a collection. But I am not sure it can create a collection with term/count that I can work with. – checklist Jan 08 '14 at 11:31
  • 1
    Why do you need a physical collection for that, the full text index will do the sorting and stuff required, you won't need to hold any data yourself – Sammaye Jan 08 '14 at 11:47
  • I need the data (tens of thousands of result documents) to display, sort etc. Also FTS is a beta feature and not ready for production. – checklist Jan 08 '14 at 15:49
  • You want facets and the sort etc comes inbuilt, honestly this is a really bad route, go for an actual FTS tech – Sammaye Jan 08 '14 at 15:49
  • @checklist I was wondering what have you chosen as a solution as I am in the same road right now. The MongoDB 2.6 FTS has all the features but if only you want to search and only meta it will let you use is Score. So it can't be used in TF-IDF. tnx – Maziyar Aug 12 '14 at 02:56
  • @Maziya - we decided to go with solr using Spring-Data-Solr. MongoDB just didn't make that for us. OF course Solr has its own issues. – checklist Aug 12 '14 at 06:23

2 Answers2

0

The feature list you have described is exactly what MongoDB's text index provides:

If you want to build your own full text search, Map/Reduce is not the best approach. For a basic solution you would be better to iterate your documents using a normal find() and build a multi-key index based on your keyword search requirements.

Outside of MongoDB there are other text search options that can be integrated to provide more advanced search options such as facets, clustering, and keyword proximity.

Community
  • 1
  • 1
Stennie
  • 63,885
  • 14
  • 149
  • 175
  • 1
    Thanks. I am aware of FTS features. But that is an index. And what I need is a resulting collection that I can work with and most importantly, a count of how many Terms / Frequency. – checklist Jan 08 '14 at 15:51
0

in the last version mongodb 2.6 include FTS (no beta, release). This version include stemming with snowball, stopwords for each language.

jvea
  • 59
  • 4
  • But the only meta it will let you use at this moment is Score not anything else. It is not like you can take advantage of stemming ,stop words, etc. in you aggregation. It can be only used in text match. – Maziyar Aug 12 '14 at 02:53