Access/search raw Mongo text index content (tokenized terms) for term auto-completion

Question

My users are asking me for a "Google-like" query term suggestion (auto-complete) useful for misspelled terms and general insight. Mongo text indices only search on complete and correctly spelled terms.

I need access to the text index itself i.e. its "words". I did read this crude solution and am looking for something less fragile than double indexing and managing term (word) reference counts.

All I want to do is get up to N index tokens that start with a specific text. Don't tell me to use the regex search, because it defeats the faster text index. I do not want to use Elastic Search, Lucene, or another external indexer: the maintenance nightmare. Text search belongs to the database, and with a few limitations Mongo excels at it.

I HIGHLY suggest you read this fully and then re-write your questions https://stackoverflow.com/help/how-to-ask — Dan Green-Leipciger, Apr 13 '17 at 18:38
there is a similar question with the answer you might be looking for [here](http://stackoverflow.com/a/13753101/4207875) (disclaimer: you might not going to like it) — saljuama, Apr 19 '17 at 22:21
Thanks, @saljuama, but I am indeed looking for an official API, not a hack. Nor coding the database myself. Why would I use a text index in the first place? — Alex Rogachevsky, Apr 20 '17 at 05:51

phpkode · Answer 1 · 2017-04-16T13:12:37.747

Since you have said no to regexp and also said that you would prefer to use built in Mongodb text search, I will suggest a method I implemented sometimes back. It can do partial word searches, multiple word searches and to a "limited extent" spelling errors, singular/ plural, present/ past tense, verb, noun searches also. But mind you this won't be efficient (may not return correct values also) if each of your fields contain 1000s of words.

Mongodb text search matches only full words so the string should be formatted accordingly. The key point is to create an alternate text field - on which you would apply text index - instead of the current field for finding text matches.

Also you have to make an array of words to match from the client side input

I will give an overview of what I did. Suppose a string in collection is

"Implementing auto-complete feature using MongoDB"

You will be creating the following text string from it and storing it as another field (text indexed field)

"im imp impl implement implementi implementin implementing au aut auto co com comp compl comple complet complete fe fea feat featu featur feature mo mon mong mongo mongod mongodb"

The process before document insertion is explained below

Clean the string - convert to lowercase, remove special chars like -,() etc
Remove insignificant words like is, was, the using, among, having etc.
Push the remaining words to an array (input_array).
For each word in input_array take substrings of length 2, 4, 5 and push it to an output_array. These will be matched for auto completion and to provide cover against some spelling errors. For example "Implementing" will generate "im", "imp", "impl"
For each word with length n in input_array take substrings with length n-3, n-2, n-1, n and push it into output_array. The benefit is that it will cover for some grammar errors/differences. For example - User types "implement", text with "implementing" will return a positive match. For example "implementing" will generate "implement", "implementi", "implementin", "implementing"
Merge the array to create a text string of multiple words and insert it into collection
Now the user search input also have to be formatted into an array. Steps 1, 2, 3, 4, 5 are followed here also to create a search_input_array.
The benefit of applying step 4 to client search string is that it can give 'some' protection against spelling errors. For example user types "impdement", the formatted array will be ('im', 'imp', 'impd', 'impde', 'impdem', 'impdeme', 'impdement'). You can see that two valid matches are available for implement. Rest of the words are improper words and will match very few entries
Now the benefit of applying step 5 to client search terms is to provide some protection against grammer variations like present/ past tense, singular/plural, noun/verb etc. For example user types either of "implement", "implementing", "implemented", "implements" the formatted search array will always contain the term "implement" there by giving a valid match to our entry in the collection.
Matching has to be done by using query like

query["$text"] = {$search : formatted_search_input_array};
If you want to display suggestion tokens, you should process a bit on the result set. You should get 'original text' from the top n matches. Then clean and split the words. Do a direct substring match using terms search_array, and return the matches as tokens. But if you have small sentences with less than 10 words, you can return the complete text also like google does (that will appear better if the user types more than one word queries)

You will get better results if your strings are short. And of course criteria to generate the text string should be modified to suit your needs. You should also consider storing the formatted alternate text in another collection and linking it by objectid reference if it is large.

Your assertion about MongoDB only searching for full words is false, take a look at the reference documentation (v3.4), when you see how [text indexes](https://docs.mongodb.com/manual/core/index-text/#index-entries) are created, with stems of words, unless the language `none` is specified. — saljuama, Apr 19 '17 at 22:08
I am using MongoDB 2.6. As per the documentation -> The $text operator can search for words and phrases. The query matches on the complete stemmed words. For example, if a document field contains the word blueberry, a search on the term blue "will not match" the document. However, a search on either blueberry or blueberries will match. — phpkode, Apr 20 '17 at 08:46
You can see that the algo i suggested will match blue. I have used it in phrases containing names of people and similar and in such cases pure stemmed word matching is not correct — phpkode, Apr 20 '17 at 08:53

score 0 · Answer 2 · answered Apr 19 '17 at 21:54

Key to fast search response time more or less depends on the how many items to traverse storage/file/database, frequency write operations on the source, amount of throttling and network or hardware overhead. Let's break those down and make a strategy to improve in all those areas.

Full article is here

Access/search raw Mongo text index content (tokenized terms) for term auto-completion

2 Answers2