In MongoDB, how problematic is "Scanned Objects / Returned has gone above 1000”

Question

Related to my question from earlier today. This is also a related post on the topic. We've built a Live Search that fetches the top 20 responses from a collection of 250K names, and we want to do data-fetching correctly.

Currently, if I use:

db.collection.find({ "drug": { "$regex": "cols", "$options": "i" } })

then I get bombarded with email warnings from MongoDB Atlas saying Scanned Objects / Returned has gone above 1000. This is because I am not using $search, so I am not using the text index. Each query seems to be scanning the whole 250K rows to get the best 20 matches. Unfortunately, if I use this:

db.collection.find({ $text: { $search: "dog cat" } })

While I do not get bombarded by emails, the search results are not good because they do not capture partial strings... For example, if I search for the basketball player Zion Williamson, I get no results when the partial string Zion Williams is typed in... with regex, it correctly returns Zion Williamson.

Is it problematic to stick with the regex approach, and ignore these email warnings? Until mongo's $search is better at capturing partial strings, I don't want to use it in my live search. Perhaps it is possible to turn off the email alerts for this particular warning for this particular table only?

Thanks in advance for any thoughts on this!

Edit: The collection in question is fairly small (16MB), with ~250K documents and 5 values in each document. Also, the performance of both $regex and $search are sufficient (~0.1 seconds) - the full table scan with $regex is not hurting the data-fetching performance too much.

score 1 · Accepted Answer · answered Oct 19 '20 at 21:01

As you added in your edit - the question comes mostly down to performance. However, consider not just the performance of a single query, but the performance under production load. How many requests / second do you expect do perform this query?

If it's done once every second or even less often, it will be ok for you for now. But what if the data set grows or you start seeing more requests? that's when you'll start running into issues.

From your example it's not clear to me if a case insensitive "starts_with" search is all that you need. If that's the case, then you can actually use the index for it:

store the search string additionally in a to_lower or to_upper form
change your regex to be ^lower cased search string

now your search string is case insensitive and can use an index (it's equivalent to like 'string%')

However, keep in mind that using to_lower or to_upper can mess with your text (ToUpperInvariant() – is MSDN wrong on its recommendation?) - and of course if you still need a full text search, this won't work in that case you can

live with the performance restrictions mentioned above
live with the feature restrictions of the normal full text search query
build a custom trigram index (or further research to improve text search in postgres)
look for alternatives. If you need full text search, maybe a special build DB like ElasticSearch is a better fit for your use case

thanks for input. The data has some labels that have prefixes in them (e.g. `M: Duke Blue Devils` and `F: Duke Blue Devils` for mens vs womens), and also, last names only will be used by users often when searching. For these reasons, I think `starts_with` would struggle to provide the best search results. — Canovice, Oct 19 '20 at 23:29
Whole I only suggested to use to upper / to lower to unify the data, you can perform any sort of modification, like removing the prefix or have another property which flips first name and last name. This might sound tedious, but I've seen similar things done. However, if it's feasible for you concrete dataset can be best judged by you — peter, Oct 20 '20 at 03:52

In MongoDB, how problematic is "Scanned Objects / Returned has gone above 1000”

1 Answers1