27

I have a MongoDB collection of documents of the form

{
    "id": 42,
    "title": "candy can",
    "description": "canada candy canteen",
    "brand": "cannister candid",
    "manufacturer": "candle canvas"
}

I need to implement auto-complete feature based on the input search term by matching in the fields except id. For example, if the input term is can, then I should return all matching words in the document as

{ hints: ["candy", "can", "canada", "canteen", ...]

I looked at this question but it didn't help. I also tried searching how to do regex search in multiple fields and extract matching tokens, or extracting matching tokens in a MongoDB text search but couldn't find any help.

Community
  • 1
  • 1
ajay
  • 9,402
  • 8
  • 44
  • 71
  • What the top answer to that question suggests (regular expression with begin-of-string anchor) would be exactly what I would recommend you to do. Why exactly doesn't this solve your problem? – Philipp Apr 27 '15 at 11:20
  • @Philipp But it works when searched in only one field. Also, I don't have an array to search in, it's a string. Do you suggest I tokenize all fields which I want to match, and store those tokens in an array? – ajay Apr 27 '15 at 11:29
  • 2
    That would definitely be the most query-friendly solution (not so friendly for updates, though) – Philipp Apr 27 '15 at 11:40
  • 1
    This really does sound like a use case for text search. You said you did not find anything helpful regarding this. A good beginning reference for text search is the Mongo DBA course video regarding this topic via youtube at (https://www.youtube.com/watch?v=Hjx3eUqU0XE&list=PLGOsbT2r-igm8L5m85zc6BkZjAdzIuC5i&index=10). – SDillon Apr 27 '15 at 12:53
  • @SDillon With text search, I need to extract the (partially) matching tokens. I couldn't find any help regarding this. – ajay Apr 27 '15 at 12:57
  • you can search on all fields by giving wildcard query.. and using prefix search – karthik manchala Apr 27 '15 at 13:40

2 Answers2

44

tl;dr

There is no easy solution for what you want, since normal queries can't modify the fields they return. There is a solution (using the below mapReduce inline instead of doing an output to a collection), but except for very small databases, it is not possible to do this in realtime.

The problem

As written, a normal query can't really modify the fields it returns. But there are other problems. If you want to do a regex search in halfway decent time, you would have to index all fields, which would need a disproportional amount of RAM for that feature. If you wouldn't index all fields, a regex search would cause a collection scan, which means that every document would have to be loaded from disk, which would take too much time for autocompletion to be convenient. Furthermore, multiple simultaneous users requesting autocompletion would create considerable load on the backend.

The solution

The problem is quite similar to one I have already answered: We need to extract every word out of multiple fields, remove the stop words and save the remaining words together with a link to the respective document(s) the word was found in a collection. Now, for getting an autocompletion list, we simply query the indexed word list.

Step 1: Use a map/reduce job to extract the words

db.yourCollection.mapReduce(
  // Map function
  function() {

    // We need to save this in a local var as per scoping problems
    var document = this;

    // You need to expand this according to your needs
    var stopwords = ["the","this","and","or"];

    for(var prop in document) {

      // We are only interested in strings and explicitly not in _id
      if(prop === "_id" || typeof document[prop] !== 'string') {
        continue
      }

      (document[prop]).split(" ").forEach(
        function(word){

          // You might want to adjust this to your needs
          var cleaned = word.replace(/[;,.]/g,"")

          if(
            // We neither want stopwords...
            stopwords.indexOf(cleaned) > -1 ||
            // ...nor string which would evaluate to numbers
            !(isNaN(parseInt(cleaned))) ||
            !(isNaN(parseFloat(cleaned)))
          ) {
            return
          }
          emit(cleaned,document._id)
        }
      ) 
    }
  },
  // Reduce function
  function(k,v){

    // Kind of ugly, but works.
    // Improvements more than welcome!
    var values = { 'documents': []};
    v.forEach(
      function(vs){
        if(values.documents.indexOf(vs)>-1){
          return
        }
        values.documents.push(vs)
      }
    )
    return values
  },

  {
    // We need this for two reasons...
    finalize:

      function(key,reducedValue){

        // First, we ensure that each resulting document
        // has the documents field in order to unify access
        var finalValue = {documents:[]}

        // Second, we ensure that each document is unique in said field
        if(reducedValue.documents) {

          // We filter the existing documents array
          finalValue.documents = reducedValue.documents.filter(

            function(item,pos,self){

              // The default return value
              var loc = -1;

              for(var i=0;i<self.length;i++){
                // We have to do it this way since indexOf only works with primitives

                if(self[i].valueOf() === item.valueOf()){
                  // We have found the value of the current item...
                  loc = i;
                  //... so we are done for now
                  break
                }
              }

              // If the location we found equals the position of item, they are equal
              // If it isn't equal, we have a duplicate
              return loc === pos;
            }
          );
        } else {
          finalValue.documents.push(reducedValue)
        }
        // We have sanitized our data, now we can return it        
        return finalValue

      },
    // Our result are written to a collection called "words"
    out: "words"
  }
)

Running this mapReduce against your example would result in db.words look like this:

    { "_id" : "can", "value" : { "documents" : [ ObjectId("553e435f20e6afc4b8aa0efb") ] } }
    { "_id" : "canada", "value" : { "documents" : [ ObjectId("553e435f20e6afc4b8aa0efb") ] } }
    { "_id" : "candid", "value" : { "documents" : [ ObjectId("553e435f20e6afc4b8aa0efb") ] } }
    { "_id" : "candle", "value" : { "documents" : [ ObjectId("553e435f20e6afc4b8aa0efb") ] } }
    { "_id" : "candy", "value" : { "documents" : [ ObjectId("553e435f20e6afc4b8aa0efb") ] } }
    { "_id" : "cannister", "value" : { "documents" : [ ObjectId("553e435f20e6afc4b8aa0efb") ] } }
    { "_id" : "canteen", "value" : { "documents" : [ ObjectId("553e435f20e6afc4b8aa0efb") ] } }
    { "_id" : "canvas", "value" : { "documents" : [ ObjectId("553e435f20e6afc4b8aa0efb") ] } }

Note that the individual words are the _id of the documents. The _id field is indexed automatically by MongoDB. Since indices are tried to be kept in RAM, we can do a few tricks to both speed up autocompletion and reduce the load put to the server.

Step 2: Query for autocompletion

For autocompletion, we only need the words, without the links to the documents. Since the words are indexed, we use a covered query – a query answered only from the index, which usually resides in RAM.

To stick with your example, we would use the following query to get the candidates for autocompletion:

db.words.find({_id:/^can/},{_id:1})

which gives us the result

    { "_id" : "can" }
    { "_id" : "canada" }
    { "_id" : "candid" }
    { "_id" : "candle" }
    { "_id" : "candy" }
    { "_id" : "cannister" }
    { "_id" : "canteen" }
    { "_id" : "canvas" }

Using the .explain() method, we can verify that this query uses only the index.

        {
        "cursor" : "BtreeCursor _id_",
        "isMultiKey" : false,
        "n" : 8,
        "nscannedObjects" : 0,
        "nscanned" : 8,
        "nscannedObjectsAllPlans" : 0,
        "nscannedAllPlans" : 8,
        "scanAndOrder" : false,
        "indexOnly" : true,
        "nYields" : 0,
        "nChunkSkips" : 0,
        "millis" : 0,
        "indexBounds" : {
            "_id" : [
                [
                    "can",
                    "cao"
                ],
                [
                    /^can/,
                    /^can/
                ]
            ]
        },
        "server" : "32a63f87666f:27017",
        "filterSet" : false
    }

Note the indexOnly:true field.

Step 3: Query the actual document

Albeit we will have to do two queries to get the actual document, since we speed up the overall process, the user experience should be well enough.

Step 3.1: Get the document of the words collection

When the user selects a choice of the autocompletion, we have to query the complete document of words in order to find the documents where the word chosen for autocompletion originated from.

db.words.find({_id:"canteen"})

which would result in a document like this:

{ "_id" : "canteen", "value" : { "documents" : [ ObjectId("553e435f20e6afc4b8aa0efb") ] } }

Step 3.2: Get the actual document

With that document, we can now either show a page with search results or, like in this case, redirect to the actual document which you can get by:

db.yourCollection.find({_id:ObjectId("553e435f20e6afc4b8aa0efb")})

Notes

While this approach may seem complicated at first (well, the mapReduce is a bit), it is actual pretty easy conceptually. Basically, you are trading real time results (which you won't have anyway unless you spend a lot of RAM) for speed. Imho, that's a good deal. In order to make the rather costly mapReduce phase more efficient, implementing Incremental mapReduce could be an approach – improving my admittedly hacked mapReduce might well be another.

Last but not least, this way is a rather ugly hack altogether. You might want to dig into elasticsearch or lucene. Those products imho are much, much more suited for what you want.

Community
  • 1
  • 1
Markus W Mahlberg
  • 19,711
  • 6
  • 65
  • 89
  • Thanks a lot for such a detailed answer :) Just what I needed. I was just looking into elasticsearch and found that it is better suited for my purposes, however for the time being, this will make do :) – ajay Apr 27 '15 at 19:13
  • 2
    @ajay Glad I could help. To be honest with you, it was a nice apprentice piece. Please note that elastic search does not give real time results, albeit you won't get any closer to them, imho. – Markus W Mahlberg Apr 27 '15 at 19:20
  • @MarkusWMahlberg: for which data sizes is this solution good? I have a dictionary with 1 million strings which consist of mostly one or two words (in average 12 chars), all running on the smallest google cloud machine – Silver May 03 '16 at 08:41
  • 1
    @Silver With an incremental map reduce, we are restricted by RAM and disk size only. 1M * 12 Bytes = 12MB. Lets even double that, and we still are talking of a negligible RAM consumption. But as always: you have to test. Index compression, available as of 3.0 when you use wiredTiger, might help here. But to be honest, I have not run any benchmarks or tested consumption. I have to admit, you are pretty much on your own, though I will be glad to help. – Markus W Mahlberg May 03 '16 at 20:18
  • 1
    very nice, it worked for me as expected and with very high performance. I made an aggregation pipeline that does exactly so [here](https://gitlab.com/bacloud14/classified-ads-58/-/snippets/2251558) as map-reduce is flagged deprecated now. – Curcuma_ Feb 10 '22 at 17:15
  • 1
    @Curcuma_ If you added your efforts as an answer, this would be more helpful -- and could be upvoted ;) – Markus W Mahlberg May 15 '22 at 08:52
0

Thanks to @Markus solution, I came up with something similar with aggregations instead. Knowing that map-reduce are flagged as deprecated for later versions.

const { MongoDBNamespace, Collection } = require('mongodb')
//.replace(/(\b(\w{1,3})\b(\W|$))/g,'').split(/\s+/).join(' ')
const routine = `function (text) {
    const stopwords = ['the', 'this', 'and', 'or', 'id']
    text = text.replace(new RegExp('\\b(' + stopwords.join('|') + ')\\b', 'g'), '')
    text = text.replace(/[;,.]/g, ' ').trim()
    return text.toLowerCase()
}`
// If the pipeline includes the $out operator, aggregate() returns an empty cursor.
const agg = [
    {
        $match: {
            a: true,
            d: false,
        },
    },
    {
        $project: {
            title: 1,
            desc: 1,
        },
    },
    {
        $replaceWith: {
            _id: '$_id',
            text: {
                $concat: ['$title', ' ', '$desc'],
            },
        },
    },
    {
        $addFields: {
            cleaned: {
                $function: {
                    body: routine,
                    args: ['$text'],
                    lang: 'js',
                },
            },
        },
    },
    {
        $replaceWith: {
            _id: '$_id',
            text: {
                $trim: {
                    input: '$cleaned',
                },
            },
        },
    },
    {
        $project: {
            words: {
                $split: ['$text', ' '],
            },
            qt: {
                $const: 1,
            },
        },
    },
    {
        $unwind: {
            path: '$words',
            includeArrayIndex: 'id',
            preserveNullAndEmptyArrays: true,
        },
    },
    {
        $group: {
            _id: '$words',
            docs: {
                $addToSet: '$_id',
            },
            weight: {
                $sum: '$qt',
            },
        },
    },
    {
        $sort: {
            weight: -1,
        },
    },
    {
        $limit: 100,
    },
    {
        $out: {
            db: 'listings_db',
            coll: 'words',
        },
    },
]
// Closure for db instance only
/**
 *
 * @param { MongoDBNamespace } db
 */
module.exports = function (db) {
    /** @type { Collection } */
    let collection
    /**
     * Runs the aggregation pipeline
     * @return {Promise}
     */
    this.refreshKeywords = async function () {
        collection = db.collection('listing')
        // .toArray() to trigger the aggregation
        // it returns an empty curson so it's fine
        return await collection.aggregate(agg).toArray()
    }
}

Please check for very minimal changes for your convenience.

Curcuma_
  • 851
  • 2
  • 12
  • 37