1

I'm trying to count occurences of a word in a message.

I've got this line of code :

 var nbOccurences = Regex.Matches(haystack, needle, RegexOptions.CultureInvariant | RegexOptions.IgnoreCase).Count;

Which works perfectly fine for e.g. "bob" in the message "my name is bob".

But (as the message can be in french), I'd like to be able to find "chene", "chène", "chêne"... when looking for "chene". Right now, words with accents don't come up as results.

I thought that adding RegexOptions.CultureInvariant would help, but it doesn't seem like it.

Any help would be appreciated.

Arsen Mkrtchyan
  • 49,896
  • 32
  • 148
  • 184

2 Answers2

1

You can use this method to convert extended letters to their base:

string RemoveDiacritics(string stIn)
    {
        var stFormD = stIn.Normalize(NormalizationForm.FormD);
        var sb = new StringBuilder();
        for (var ich = 0; ich < stFormD.Length; ich++)
        {
            var uc = System.Globalization.CharUnicodeInfo.GetUnicodeCategory(stFormD[ich]);
            if (uc != System.Globalization.UnicodeCategory.NonSpacingMark)
                sb.Append(stFormD[ich]);
        }

        return (sb.ToString().Normalize(NormalizationForm.FormC));
    }

And then:

var haystack = "chêne name is chène";
var needle = "chène";
var nbOccurences = Regex.Matches(RemoveDiacritics(haystack), RemoveDiacritics(needle), RegexOptions.CultureInvariant | RegexOptions.IgnoreCase).Count;

nbOccurences will be equal to 2.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
0

That option (RegexOptions.CultureInvariant) is only connected to the RegexOptions.IgnoreCase. From https://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regexoptions(v=vs.90).aspx

CultureInvariant

Specifies that cultural differences in language is ignored. See Performing Culture-Insensitive Operations in the RegularExpressions Namespace for more information.

I don't think there is a way to do what you want other than having a version of the text to be searched without diacritics (see for example How do I remove diacritics (accents) from a string in .NET?)

Note that if you simply want to look for a word you can:

var compareinfo = CultureInfo.InvariantCulture.CompareInfo;
var index = compareinfo.IndexOf("My name is chêne", "chene", CompareOptions.IgnoreNonSpace | CompareOptions.IgnoreCase);
bool found = index > -1;

(taken from allow accented characters to be searchable?)

Community
  • 1
  • 1
xanatos
  • 109,618
  • 12
  • 197
  • 280