How to correctly design a regular expression in pymongo?

Question

I use python 3.7.1 (default, Dec 14 2018, 19:28:38), and pymongo 3.7.2.

In mongodb this works:

db.collection.find(
    {$and:[
    {"field":{$regex:"bon?"}},
    {"field":{$not:{$regex:"bon souple"}}},
    {"field":{$not:{$regex:"bon léger"}}}
    ]}
    )

So in pymongo I did the same as:

db.collection.find(
    {"$and":[
    {"field":{"$regex":"bon?"}},
    {"field":{"$not":{"$regex":"bon souple"}}},
    {"field":{"$not":{"$regex":"bon léger"}}}
    ]}
    )

but it indicatespymongo.errors.OperationFailure: $regex has to be a string.

So I tried this as proposed here:

liste_reg=[
{'field': {'$regex': {'$not': re.compile('bon souple')}}}, 
{'field': {'$regex': {'$not': re.compile('bon léger')}}}, 
{'field': {'$regex': re.compile('bon?')}}
]
rslt=list(
    db.collection.find({"$and":liste_reg})
)

I noticed that even when there is no special character it indicates the same error:

liste_reg=[
{'field': {'$regex': {'$not': re.compile('bon souple')}}} #where no special char is present
]
rslt=list(
    db.collection.find({"$and":liste_reg})
)

So I tried to use "/" as:

liste_reg=[
{'field': {'$regex': {'$not':'/bon souple/'}}} #where no special char is present
#even tried re.compile('/bon souple/')
]
rslt=list(
    db.collection.find({"$and":liste_reg})
)

the same error pymongo.errors.OperationFailure: $regex has to be a string still occurs.

What can I do?

SOME UPDATE OF MY RESEARCH OF SOLUTION

the core of the issue seems to be with $not because when I do:

liste_reg=[{'field': {'$regex': 'bon?'}}]
rslt=list(
    db.collection.find({"$and":liste_reg})
)
len(rslt)#gives 23 013, what is ok.

There is no error.

SOME SAMPLES

As asked by Emma I can give a sample, and it will explicit my request in mongo. Normally I must have these modalities in the field:

sec
très léger
léger
bon léger
bon
bon souple
souple
très souple
collant
lourd
très Lourd
profond

The main problem for me is my spider did not parse correctly because I did not write a strong enough script for that. Instead of obtaining just "bon", I obtain this kind of result:

{"_id":"ID1",
"field":"bon\r\n\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\tnon",
...}

and that's an example between many others wrong parsing. So that's why I want result that begins with "bon?" but not "bon souple" or "bon léger" because they have correct values, no \n or \t.

So as samples:

[{"_id":"ID1",
"field":"bon\r\n\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\tnon"},
{"_id":"ID2",
"field":"bon\r\n\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\r\n\t\t\t\t\t\tpremière"},
{"_id":"ID3",
"field":"bon\r\n\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\r\n\t\t\t\t\t\t2ème"},
{"_id":"ID4",
"field":"bon souple"},
{"_id":"ID5",
"field":"bon léger"}]

@Emma I did an update with a sample of what you asked. Or at least what I think you asked. — AvyWam, May 29 '19 at 19:42
@Emma as you said in your DEMO it works in it. But, I am not able to explain you why, when I write this in the mongo shell in robo3t `db.collection.find({"field":{$regex:"bon[^\s].+"}})` the first file which appears is `{ "_id" : "364714",..., "field" : "bon léger"}`. I looked at View document in order to see if it is not an exception like `"bon\t\t\t\t\nléger"`, and actually this is really `"bon léger"`. In my mongo shell it takes the spacebar in consideration. Besides in pymongo I obtain an empty list with `len(list(db.geny_rapp.find({'etat_terrain': {'$regex': "bon[^\s].+"}})))`. — AvyWam, May 29 '19 at 20:45
@Emma honestly I have another way to do to answer my problematic, but without regex, that's more complicated and I use the set and operations on set: setA-setB -> the set I want. But as I said it is more complicated and that's not the goal. — AvyWam, May 29 '19 at 20:49

score 4 · Accepted Answer · answered May 30 '19 at 19:22

I just ran into this same issue.

Try doing this:

liste_reg=[
{'field': {'$not': re.compile('bon souple')}}, 
{'field': {'$not': re.compile('bon léger')}}, 
{'field': {'$regex': re.compile('bon?')}}
]
rslt=list(
    db.collection.find({"$and":liste_reg})
)

I just removed the $regex part of the query.

Background

I tried doing {item["type"]: {"$not": item['name']}} and pymongo returned a $not needs a regex or a document error.

So, I tried: {item["type"]: {"$not": {"$regex": item['name']}}} and pymongo returned a $not cannot have a regex error.

I found this SO https://stackoverflow.com/a/20175230/9069964 and here's what finally worked for me:

item_name = item["name"]
{item["type"]: {"$not": re.compile(item_name)}}

I had to ditch the "$regex" part and give "$not" my regex stuff.

That's great! It works, and that's totally in the spirit of my code. Besides it gives the way to use '$not' without avoiding it. — AvyWam, May 31 '19 at 17:59

score 1 · Answer 2 · answered May 29 '19 at 23:05

1

Try using a string literal with a negative look ahead. The example below should work as long as you have a carriage return (\r) after 'bon'.

import re
bon = re.compile(r'bon(?=\r)')
db.collection.find({'field': bon})

answered May 29 '19 at 23:05

chuck_sum

113
6

`len(list(db.collection.find({'field': {'$regex': re.compile(r'bon(?=\r)')}})))` gives me 19 files. While I expect 22242. I think I will answer my problematic with another way than only regex and use the properties of set objects. – AvyWam May 30 '19 at 12:38
Might be easier just to clean up your data. `bon_dirty = 'bon\r\n\t' bon_clean = bon_dirty.strip()` – chuck_sum May 30 '19 at 13:03
Well I did a dump of my collection, and now it's clear, that's what I expect. It returns the same number of files than mongo with $not. But it's still mysterious why `re.compile()` does not work for me while it does for [others](https://groups.google.com/forum/#!topic/mongodb-user/FdFJWzmKfds). – AvyWam May 30 '19 at 13:27

Emma · Answer 3 · 2019-05-29T21:15:37.277

Here, we might be able to approach solving this problem, maybe without using the $not feature. For instance, if we wish to not have bon souple or bon léger which are bon followed by an space, we could maybe use an expression similar to:

"bon[^\s].+"

DEMO

I'm not so sure about what we wish to extract here, but I was just guessing that maybe we would want to swipe bon values not followed by an space and in between the ".

Also, we would likely want to look into regex query requirements and adjust our expressions to it, if necessary, such as with escaping or using capturing group:

(bon[^\s].+)

or:

"(bon[^\s].+)"

or:

\"(bon[^\s].+)\"

or:

([\s\S]*?)\"(bon[^\s].+)\"

DEMO

RegEx Circuit

jex.im visualizes regular expressions:

I'm not quite sure if this would be what we might want or if it would be relevant, yet according to this documentation, we can try using:

{ name: { $regex: /([\s\S]*?)\"(bon[^\s].+)\"/, $options: "mi" } }

or:

{ name: { $regex: '([\s\S]*?)\"(bon[^\s].+)\"', $options: "mi" } }

db.collection.find

db.collection.find({"field":{ $regex: /(bon[^\s].+)/, $options: "mi" }})

or:

db.collection.find({"field":{ $regex: /(bon[^\s].+)/, $options: "si" }})

Reference:

PyMongo $in + $regex

Performing regex Queries with pymongo

Doing `db.collection.find({"field":{$regex:"\"(bon[^\s].+)\"" }})` or `db.collection.find({"field":{$regex:"([\s\S]*?)\"(bon[^\s].+)\""}})` gives: `Fetched 0 record(s) in 55ms`. Note I entered `"\"(bon[^\s].+)\""` and not `\"(bon[^\s].+)\"`, same for `([\s\S]*?)\"(bon[^\s].+)\"`, because it raises error in mongo shell. — AvyWam, May 29 '19 at 21:04
`db.collection.find({"field":{ $regex: /([\s\S]*?)\"(bon[^\s].+)\"/, $options: "mi" }})` no error, but result is 0. — AvyWam, May 29 '19 at 21:08

How to correctly design a regular expression in pymongo?

3 Answers3

DEMO

DEMO

RegEx Circuit

db.collection.find

PyMongo $in + $regex

Performing regex Queries with pymongo