RegexOptions.CultureInvariant not finding matches for accents

Question

I would like to create a regex that ignores accent.

For instance:

string s = "I am an old élephant";
string pattern = "elephant";
bool result = new Regex(pattern, RegexOptions.CultureInvariant).IsMatch(s);

My culture when I test is:

System.Globalization.CultureInfo.CurrentCulture = Fr-fr

So I would have expected this code to find a match but it does not.

Is there an easy way to get a match for this?

I am trying to make a StringReplace overload method that would replace élèphânt with elephant and so on.

"My culture when I test is" irrelevant, since you specified `RegexOptions.CultureInvariant`. — , Nov 25 '16 at 08:48
@A.D. Look at http://stackoverflow.com/questions/249087/how-do-i-remove-diacritics-accents-from-a-string-in-net — αNerd, Nov 25 '16 at 09:17

user1519979 · Accepted Answer · 2016-11-25T12:00:55.317

Use following method:

    public string removeDiacritics(string str)
    {
        var sb = new StringBuilder();

        foreach (char c in str.Normalize(NormalizationForm.FormD))
        {
            if (CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
            {
                sb.Append(c);
            }
        }
        return sb.ToString().Normalize(NormalizationForm.FormC);
    }

Then it works

        string s = "I am an old élephant";
        string pattern = "elephant";
        bool result = new Regex(pattern, RegexOptions.IgnoreCase).IsMatch(removeDiacritics(s)); //true

If you have to replace something e.g. iterate (backward) through the matchcollection and edit you original string depending on the indexes of each match.

Explaination: (i'm using the "I am an old élephant" string)

Let's write all chars of the original string into an list :

foreach (char c in str)
{
    chars1.Add(c);
}

As you can see the char is defined as unicode char 233 or 00E9 (see http://unicode-table.com/de/#00E9)

The normalisation is explained here https://msdn.microsoft.com/en-us/library/system.text.normalizationform(v=vs.110).aspx

As the documention says: Form D:

Indicates that a Unicode string is normalized using full canonical decomposition.

That means that the char é is "split up" into an e and an accent char.

To check that, let's output the chars of the normalised string:

List<char> chars2 = new List<char>();
foreach(char c in str.Normalize(NormalizationForm.FormD))
{
    chars2.Add(c);
}

As seen in the watch, the é is now normalised into 2 characters (101 (\u0065) + 769 (\u0301))

Now we have to eliminate these accents: Iterate through all chars of the normalised string and if it's a "NonSpacingMark", add it to the StringBuilder.

MSDN: https://msdn.microsoft.com/en-us/library/system.globalization.unicodecategory(v=vs.110).aspx

NonSpacingMark

Nonspacing character that indicates modifications of a base character. Signified by the Unicode designation "Mn" (mark, nonspacing). The value is 5.

Finally to ensure that all other characters, that are now defined as 2 or 3 characters in our string, are getting "converted" into the unicode character symbol, we have to normalise our new string back to the FormC.

MSDN: FormC:

Indicates that a Unicode string is normalized using full canonical decomposition, followed by the replacement of sequences with their primary composites, if possible.

@user1519979: Maybe you should elaborate a bit how it works. I understand what you're doing, but I'm not sure everyone does... — Sefe, Nov 25 '16 at 09:20

score 1 · Answer 2 · answered Nov 25 '16 at 08:51

1

You are specifying a CultureInvariant regex. That means your culture is ignored. So you either have to remove the option...

bool result = new Regex(pattern).IsMatch(s);

...or if you want to be culture independent, expand your pattern:

string pattern = "[eé]lephant";

answered Nov 25 '16 at 08:51

Sefe

13,731
5
42
55

Default, without RegexOptions.CultureInvariant, it does not work. My understanding was that the RegexOptions.CultureInvariant tag would make it match but I misunderstood. – A.D. Nov 25 '16 at 08:55
string pattern = "[eé]lephant"; is not what I am looking for as I am looking for a generic method to find matches when comparing a string with accent to a string without accents. I am actually trying to make a StringReplace overload method that would replace élèphânt with elephant and so on. – A.D. Nov 25 '16 at 08:58
If you want to do that, use `String.Equals`. You can specify your culture there. Regex will help you with exact matches; for culture-sensitive searches it's not very useful. You should also update your question to provide that kind of information, otherwise you'll not get what you want. – Sefe Nov 25 '16 at 09:03

αNerd · Answer 3 · 2016-11-25T09:32:42.007

0

If you want to use Regular Expression, you can employ \P{L} to state that a given unicode chracter is a letter.

        string s = "I am an old ùûüÿàâçéèêëïîô";
        string pattern = @"(\p{L})";
        var regex = new Regex(pattern);
        var result = regex.Replace(s, @"$1");
        Console.WriteLine(result);//I am an old uuuyaaceeeeiio

edited Nov 25 '16 at 09:32

answered Nov 25 '16 at 09:27

αNerd

528
1
6
11

RegexOptions.CultureInvariant not finding matches for accents

3 Answers3

Linked