0

I have a query in mongo python which looks like:

collection.find({"name" : {$in : ["foo_foo", "bar_bar"]}})

And I'd like now to match names like "foo_bar", "foo_foe" etcetera. So I used a regular expresion:

collection.find({ $or : [{"name" : {"$regex" : "foo_.*" }}, 
   {"name" : {"$regex" : "bar_.*" }}] })

But when looking for lots of entries, the second query is thousands of times slower. I don't see how to improve this. Any hint would be much appreciated. What really surprises me is that matching a "foo_foo" string should be slower than matching a string "foo_.*".

Cœur
  • 37,241
  • 25
  • 195
  • 267
Juan Chô
  • 542
  • 6
  • 23
  • [Have a look here](https://stackoverflow.com/questions/3305561/how-to-query-mongodb-with-like) There is a hint that you can imporve performance by using an index for a "starts with" search. – Igl3 Jan 19 '18 at 11:31
  • thanks Ingle but my collection does not have an index. Neither I can build one: I dont own the db. The query with full string is fast enought to be acceptable. But the regex is about 100 times slower. – Juan Chô Jan 19 '18 at 11:40

1 Answers1

0

You might want to try this instead:

collection.find({ $or : [{"name" : {"$regex" : "^foo_" }}, 
   {"name" : {"$regex" : "^bar_" }}] })

or even

collection.find({"name" : {"$regex" : "^(foo|bar)_" }})

You should see a drastic improvement of the execution time. The reason being that your regular expression is effectively searching for "foo_" inside a string, not necessarily from the start of the string. So it would match "foo_" as well as "someotherfoo_" but in the latter case it would need to parse all the way to the end of the string.

Be sure you also read and understand that part of the docs:

Additionally, while /^a/, /^a./, and /^a.$/ match equivalent strings, they have different performance characteristics. All of these expressions use an index if an appropriate index exists; however, /^a./, and /^a.$/ are slower. /^a/ can stop scanning after matching the prefix.

Beyond that there's not a lot you can do if you cannot add an index, I suppose.

dnickless
  • 10,733
  • 1
  • 19
  • 34
  • I understand the docs. And I agree with adding ^. But for me matching /^a/ and /^a.*/ is exactly the same /^a/ will stop after matching. So it should do /^a.*/ . Of course /^a.*$/ will be slower. But once again matching "foo_bar" should be slower than matching "foo_.*": we only have to compare the first four characters in the former case. In the first case we have to compare the 7 chars. The regex should be faster. – Juan Chô Jan 19 '18 at 13:49
  • The `.` character can have different meanings depending on engine settings like multi line vs. single line which, I guess, is part of the reason why the engine does not just ignore .* at the end of a regex. – dnickless Jan 19 '18 at 13:52
  • Yes perhaps /^a.*/ scans until end or new line. Being perl compatible dot . matches any char but the new line. OK fair enought. Still this does not explains why matching "foo_bar" is significantly faster than matching "^foo_"? – Juan Chô Jan 19 '18 at 14:06
  • Is it still a lot faster? **With** the caret sign? – dnickless Jan 19 '18 at 14:08
  • It could be that it's because of the `$or`... Let me see if I can combine the two regexs into one. – dnickless Jan 19 '18 at 14:09
  • Yes it is faster. But the query with full text "foo_bar" executes in 0.5s. The query (It bring more elements of course) "^foo" takes about 3 minutes. "foo_.*" takes about 12 minutes – Juan Chô Jan 19 '18 at 15:44
  • Try using .explain(): https://docs.mongodb.com/manual/reference/method/cursor.explain/ – dnickless Jan 19 '18 at 15:46
  • Sorry "^foo" takes 10s. – Juan Chô Jan 19 '18 at 16:12