Use of collation in mongodb $regex

Question

Since v3.4 collations are available for find operations, especially as matches for diacritic characters are concerned. While a find query with a definite value ($eq opeartor or corresponding construct) will match letters and correspondent diacritics, the same is not true if a $regex is used in order to achieve a match on a partial search string (a 'LIKE').

Is there a to make the $regex query use the collation the same way than the $eq query?

consider example collection testcoll:

{ "_id" : ObjectId("586b7a0163aff45945462bea"), "city" : "Antwerpen" }, 
{ "_id" : ObjectId("586b7a0663aff45945462beb"), "city" : "Antwërpen" }

this query will find both records

db.testcoll.find({city: 'antwerpen'}).collation({"locale" : "en_US", "strength" : 1});

the same query using a regex will not (finds the record with 'Antwerpen' only)

db.testcoll.find({city: /antwe/i}).collation({"locale" : "en_US", "strength" : 1});

This not supported at the moment https://stackoverflow.com/questions/39160072/mongodb-case-insensitive-index-starts-with-performance-problems — zella, Aug 30 '17 at 04:59

Raphael Soares · Answer 1 · 2022-05-16T12:59:20.100

I face this same issue today and I searched the Internet like crazy trying to find a solution. Didn't find any. So I came up with my on solution, a little frankenstein that worked for me.

I created a function which removes all the special characters from a string and then replaces all the characters that could be special to the equivalent regexp that could be special. In the end I just add a "i" option to cover the capitalize strings in my DB.

export const convertStringToRegexp = (text: string) => {
  let regexp = '';
  const textNormalized = text
    .normalize('NFD')
    .replace(/[\u0300-\u036f]/g, '') // remove all accents
    .replace(/[|\\{}()[\]^$+*?.]/g, '\\$&') // remove all regexp reserved char
    .toLowerCase();

  regexp = textNormalized
    .replace(/a/g, '[a,á,à,ä,â,ã]')
    .replace(/e/g, '[e,é,ë,è,ê]')
    .replace(/i/g, '[i,í,ï,ì,î]')
    .replace(/o/g, '[o,ó,ö,ò,õ,ô]')
    .replace(/u/g, '[u,ü,ú,ù,û]')
    .replace(/c/g, '[c,ç]')
    .replace(/n/g, '[n,ñ]')
    .replace(/[ªº°]/g, '[ªº°]');
  return new RegExp(regexp, 'i'); // "i" -> ignore case
};

And in my find() method, I just use this function with $regex option, like this:

db.testcoll.find({city: {$regex: convertStringToRegexp('twerp')} })

/*
Output:
[
  { "_id" : ObjectId("586b7a0163aff45945462bea"), "city" : "Antwerpen" }, 
  { "_id" : ObjectId("586b7a0663aff45945462beb"), "city" : "Antwërpen" }
]
*/

I also create a .spec.ts file (using Chai) to test this function. Of course you could adapt to Jest.


describe('ConvertStringToRegexp', () => {
  it('should convert all "a" to regexp', () => {
    expect(convertStringToRegexp('TAÁdaáh!')).to.deep.equal(
      /t[a,á,à,ä,â,ã][a,á,à,ä,â,ã]d[a,á,à,ä,â,ã][a,á,à,ä,â,ã]h!/i
    );
  });
  it('should convert all "e" to regexp', () => {
    expect(convertStringToRegexp('MEÉeéh!')).to.deep.equal(
      /m[e,é,ë,è,ê][e,é,ë,è,ê][e,é,ë,è,ê][e,é,ë,è,ê]h!/i
    );
  });
  it('should convert all "i" to regexp', () => {
    expect(convertStringToRegexp('VÍIiishí!')).to.deep.equal(
      /v[i,í,ï,ì,î][i,í,ï,ì,î][i,í,ï,ì,î][i,í,ï,ì,î]sh[i,í,ï,ì,î]!/i
    );
  });
  it('should convert all "o" to regexp', () => {
    expect(convertStringToRegexp('ÓOoóhhhh!!!!')).to.deep.equal(
      /[o,ó,ö,ò,õ,ô][o,ó,ö,ò,õ,ô][o,ó,ö,ò,õ,ô][o,ó,ö,ò,õ,ô]hhhh!!!!/i
    );
  });
  it('should convert all "u" to regexp', () => {
    expect(convertStringToRegexp('ÚUhuuúll!')).to.deep.equal(
      /[u,ü,ú,ù,û][u,ü,ú,ù,û]h[u,ü,ú,ù,û][u,ü,ú,ù,û][u,ü,ú,ù,û]ll!/i
    );
  });
  it('should convert all "c" to regexp', () => {
    expect(convertStringToRegexp('Cacacacaca')).to.deep.equal(
      /[c,ç][a,á,à,ä,â,ã][c,ç][a,á,à,ä,â,ã][c,ç][a,á,à,ä,â,ã][c,ç][a,á,à,ä,â,ã][c,ç][a,á,à,ä,â,ã]/i
    );
  });
  it('should remove all special characters', () => {
    expect(
      convertStringToRegexp('hello 123 °º¶§∞¢£™·ª•*!@#$%^WORLD?.')
    ).to.deep.equal(
      /h[e,é,ë,è,ê]ll[o,ó,ö,ò,õ,ô] 123 [ªº°][ªº°]¶§∞¢£™·[ªº°]•\*!@#\$%\^w[o,ó,ö,ò,õ,ô]rld\?\./i
    );
  });
  it('should accept all regexp reserved characters', () => {
    expect(
      convertStringToRegexp('Olá [-[]{}()*+?.,\\o/^$|#s] Mundo! ')
    ).to.deep.equal(
      /* eslint-disable @typescript-eslint/no-explicit-any */
      /[o,ó,ö,ò,õ,ô]l[a,á,à,ä,â,ã] \[-\[\]\{\}\(\)\*\+\?\.,\\[o,ó,ö,ò,õ,ô]\/\^\$\|#s\] m[u,ü,ú,ù,û][n,ñ]d[o,ó,ö,ò,õ,ô]! /i
    );
  });
});

You might want to add some tests with commas in them ;) The RegExp character sets will match any character from that set, including the comma. You don't separate the characters in the character set – those are just lists of characters (and character ranges). So yes, your first test RegExp matches `'ÓOoóhhhh!!!!'`, but it would also match `',,,,hhhh!!!!'`. — pepkin88, Feb 20 '23 at 14:53

score 4 · Accepted Answer · edited Apr 13 '20 at 12:15

4

Documentation

Case insensitive regular expression queries generally cannot use indexes effectively. The $regex implementation is not collation-aware and is unable to utilize case-insensitive indexes.

edited Apr 13 '20 at 12:15

tomerpacific

4,704
13
34
52

answered Apr 13 '20 at 11:36

joseph

76
4

Thx for pointing that out. The problem is therefore yet UNSOLVED :( – Freud Chicken Apr 20 '20 at 18:17

gazdagergo · Answer 3 · 2021-01-20T04:49:22.903

There is no need to use collation on top of regex. You can functionally implement this behaviour using the correct regex.

Considering to the Antwerpen example the following regex gives you all the matches in the database:

/antw[eë]rpen/i

To generate the above regex you have to regex-replace your search string first using the following replace formula:

str.replace(/e/ig, '[eë]')

And of course you have to do it with all diactric character. Also you can simply use the following library: diacritic-regex.

Use of collation in mongodb $regex

3 Answers3

Linked