2

Is there a way to use preg_match (e.g. perhaps via a flag) to do diacritic-insensitive matches?

For example, say I'd like it to match:

  • cafe
  • café

I know I can do a regex like this: caf[eé]. This regex will work as long as I don't come across any other diacritic variations of e, like: ê è ë ē ĕ ě ẽ ė ẹ ę ẻ.

Of course, I could just list all of those diacritic variations in my regex, such as caf[eêéèëēĕěẽėẹęẻ]. And as long as I don't miss anything, I'll be good. I would just need to do this for all the letters in the alphabet, which is a tedious and prone-to-error solution.

It is not an option for me to find and replace the diacritic letters in the subject with their non-diacritic counterparts. I need to preserve the subject as-is.

The ideal solution for me is to have regex to be diacritic-insensitive. With the example above, I want my regex to simply be: cafe. Is this possible?

StackOverflowNewbie
  • 39,403
  • 111
  • 277
  • 441

1 Answers1

1

If you're open to matching a letter from any language (which includes characters with dicritic), then you could use \p{L} or \p{Letter} as shown here: https://regex101.com/r/UBGQI6/3

According to regular-expressions.info,

\p{L} or \p{Letter}: any kind of letter from any language.

  • \p{Ll} or \p{Lowercase_Letter}: a lowercase letter that has an uppercase variant.
  • \p{Lu} or \p{Uppercase_Letter}: an uppercase letter that has a lowercase variant.
  • \p{Lt} or \p{Titlecase_Letter}: a letter that appears at the start of a word when only the first letter of the word is capitalized.
  • \p{L&} or \p{Cased_Letter}: a letter that exists in lowercase and uppercase variants (combination of Ll, Lu and Lt).
  • \p{Lm} or \p{Modifier_Letter}: a special character that is used like a letter.
  • \p{Lo} or \p{Other_Letter}: a letter or ideograph that does not have lowercase and uppercase variants.

The only catch is that you can't search for particular letters with a diacritic such as È, and so you can't limit your search to English letters.

Community
  • 1
  • 1
Robo Mop
  • 3,485
  • 1
  • 10
  • 23
  • 2
    So, it would match `cafó`? I need to think really hard if this is acceptable in my case. – StackOverflowNewbie Jan 18 '19 at 03:41
  • @StackOverflowNewbie My bad, I assumed you'd want to match multiple words later on. I have another possible solution that I'll try out till then. – Robo Mop Jan 18 '19 at 03:43
  • I'm trying to match a number of words with diacritics. I don't want to accept just any unicode character. I need to match the "counterparts" of the English letter that I put (e.g. `e` would also match `é`, but not `ó`. – StackOverflowNewbie Jan 18 '19 at 03:53