2

I have a really strange problem where two seemingly identical strings are returning different matches in MongoDB.

I have copied and pasted the two (apparently different) strings below with the quotation marks either side (direct from Robo3T) to highlight any whitespace that may be present - of which there is none visible.

"KUWAIT: Premier League"
"KUWAIT: Premier League"

Searching with one of these strings returns one set of documents and searching with the other string returns another set of documents. Both those sets of documents should be returned as one.

This is starting to cause me a headache as I have the string stored in another collection that I lookup dynamically, and half the time it's not matching.

Is there anyway I can validate what the issue is here? I've looked at BSON types and can see only one String $type, which Robo3T confirms.

The problem exists using Mongoose & querying with Robo3T.

Thanks.

Reece Daniels
  • 1,147
  • 12
  • 16
  • 3
    Can you provide a reproducible example? Either on https://mongoplayground.net/ or as a docker image with some data. – Alex Blex May 17 '19 at 14:07
  • 1
    Have you ruled out that there are invisible or similar-looking Unicode characters in the strings? In Python you could check with `repr` – Daniel F May 18 '19 at 08:25
  • Haven't used Python before but the .length of either string is identical which I have read is a simple way to check for lurking non-width Unicode.. https://stackoverflow.com/questions/11305797/remove-zero-width-space-characters-from-a-javascript-string I'll try and sort out a reproducible sample... – Reece Daniels May 19 '19 at 17:32
  • I hadn't come across mongoplayground.net before. They're picking up something in the formatting of the space character which str.length isn't picking up.. No further guidance on the site though... https://mongoplayground.net/p/q4eQD4DZoAj – Reece Daniels May 19 '19 at 21:18

1 Answers1

0

Using MongoPlayground I was able to proceed with a logical next step on this and figured out the issue.

I ventured over to https://www.online-toolz.com/tools/text-unicode-entities-convertor.php and noticed the blankspace was saved as %A0 rather than %20 - had no idea this was possible.

So I just need to replace all the %A0 (\u00a0) with %20 across my collections and everything will be good to go.

As an aside, MongoDB doesn't allow you to query using for those Unicode gaps using the \u i.e. {$regex: /.*\uxxxx.*/}. You must use {$regex: /.*\x{xxxx}.*/}, which I discovered here: MongoDB \uXXXX issue.

I'm surprised this isn't flagged anywhere when saving the document as a potential issue - it would at least be a helpful warning - but I can at least fix my problem now.

Thanks for your comments pointing me in the right direction.

Reece Daniels
  • 1,147
  • 12
  • 16