0

I am using the following regular expression for a filter of an application that connects to a MongoDB database:

{"$regex": re.compile(r'\b' + re.escape(value) + r'\b', re.IGNORECASE | re.UNICODE)}

The regular expression meets my search criteria however I have a problem and that is that it does not ignore accents. For example:

The database entry is: "Escobar, el patrón del mal Colombia historia".

And I search for "El patron".

I do not get any result because the "accent" in the letter O does not let me fetch the record. How can I fix it? I thought that with the re.UNICODE part I would ignore this.

Kache
  • 15,647
  • 12
  • 51
  • 79
Diego L
  • 185
  • 1
  • 9
  • The [`UNICODE`](https://docs.python.org/2.7/library/re.html#re.UNICODE) flag hasn't been supported since python 2.7. All it does is change the definition of `\w`, `\W`, `\b`, `\B`, `\d`, `\D`, `\s` and `\S`. If your regex explicitly matches `o`, then it won't automatically match `ó`, which is a different character (as you know). – JDB May 31 '23 at 16:52
  • 1
    (`IGNORECASE` will cause the regex to see `ü` and `Ü` as the same character, though [there are flaws](https://haacked.com/archive/2016/02/29/regex-turkish-i/) even to this feature) – JDB May 31 '23 at 16:56
  • You also need to ensure that [collation](https://www.mongodb.com/docs/manual/reference/collation/) rules being used on MongoDB are set appropriately, i.e set `strength` to `1`, and if you need case distinctions set `caseLevel` to `true`. – Andj Jun 01 '23 at 08:53

1 Answers1

2

Because o and ó are different characters. re.UNICODE does not do what you think it does, you can read about it here: https://docs.python.org/3/library/re.html#re.ASCII

You can solve this issue by first preprocessing strings to convert all such characters to their associated ascii counterparts before searching through with a regex. See: What is the best way to remove accents (normalize) in a Python unicode string?

Kache
  • 15,647
  • 12
  • 51
  • 79
  • It depends on what the locale is and whether the letter with a diacritic is considered a base character for the collation rules for that locale. – Andj Jun 01 '23 at 08:50