3

I want to make a word boundary search. For example, suppose you have the following entries:

  1. "the cooks."
  2. "cooks"
  3. " cook."
  4. "the cook is"
  5. "cook."

And make a search to find entries which contain "cook" as a whole. That is, only the 3th, 4th and 5th entries should be returned.

In this case, when I use \b word boundary statement, it somehow becomes distorted due to automatic escaping.

import re, pymongo
# prepare pymongo
collection.find({"entry": re.compile('\bcook\b').pattern})

When I print the query dictionary, the \b becomes \\b.

My question is how can I make a word boundary search using PyMongo? I am able to do this in MongoDB shell but failed at PyMongo.

Muatik
  • 4,011
  • 10
  • 39
  • 72
  • I think it needs to be `\\bcook\\b` – David says Reinstate Monica Nov 23 '15 at 22:16
  • yes, `\bcook\b` becomes `\\bcook\\b` – Muatik Nov 23 '15 at 22:20
  • Try [`r'\bcook\b'`](http://stackoverflow.com/questions/2241600/python-regex-r-prefix). – Sam Nov 23 '15 at 22:21
  • I have tried it and the result is the same. I think the reason is that dict creating makes it escaped. `{"field": re.compile(r'\bsome\b').pattern}` – Muatik Nov 23 '15 at 22:26
  • I found this comment: [*It appears that the output in the shell is misleading. The slashes are correctly stored, escaped as ``\\`` (double slash) and the client driver handles the escaping back to single slashes.*](http://stackoverflow.com/questions/11318850/why-doesnt-mongodb-store-my-slashes-in-this-string#comment14899723_11318879). Hope it helps. Did you just try your query? Does it work? – Wiktor Stribiżew Nov 23 '15 at 22:31

3 Answers3

3

Instead of using the pattern property which yields a str object, use the regex pattern object.

cursor = db.your_collection.find({"field": re.compile(r'\bcook\b')})

for doc in cursor:
    # your code
salmanwahed
  • 9,450
  • 7
  • 32
  • 55
1

This requires a "full-text search" index to match all your cases. No simple RegEx sufficient.

For example, you need English stemming to find both "cook" &"cooks". Your RegEx matches the whole string "cook" between spaces or word boundaries, not "cooks" or "cooking".

There are many "full text search" indexing engines. Research them to decide which one to use. - ElasticSearch - Lucene - Sphinx

PyMongo, I assume, connects to MongoDB. The latest version has built-in full-text indexing. See below.

MongDB 3.0 has these indexes: https://docs.mongodb.org/manual/core/index-text/

Mogsdad
  • 44,709
  • 21
  • 151
  • 275
Andrew
  • 11
  • 1
0

All of these test cases are handled by a simple re expression in Python. Example:

>>> a = "the cooks."
>>> b = "cooks"
>>> c = " cook."
>>> d = "the cook is"
>>> e = "cook."
>>> tests = [a,b,c,d,e]
>>> for test in tests:
        rc = re.match("[^c]*(cook)[^s]", test)
        if rc:
                print '   Found: "%s" in "%s"' % (rc.group(1), test)
        else:
                print '   Search word NOT found in "%s"' % test


   Search word NOT found in "the cooks."
   Search word NOT found in "cooks"
   Found: "cook" in " cook."
   Found: "cook" in "the cook is"
   Found: "cook" in "cook."
>>>