6

I'm studying string.Normalize() method and I thought it is used to compare string equality if they are using different unicode.

Here's what I've done so far. Is the string.Equals() is not what I'm supposed to use here?

        string stra = "á";
        string straNorm = stra.Normalize();
        string strFormC = stra.Normalize(NormalizationForm.FormC);
        string strFormD = stra.Normalize(NormalizationForm.FormD);
        string strFormKC = stra.Normalize(NormalizationForm.FormKC);
        string strFormKD = stra.Normalize(NormalizationForm.FormKD);
        Console.WriteLine("norm {0}",straNorm);
        Console.WriteLine("C {0}", strFormC);
        Console.WriteLine("D {0}", strFormD);
        Console.WriteLine("KC {0}", strFormKC);
        Console.WriteLine("KD {0}", strFormKD);

        Console.WriteLine("a".Equals(stra)); //false
        Console.WriteLine("a".Equals(straNorm)); //false
        Console.WriteLine("a".Equals(stra.Normalize())); //false
        Console.WriteLine("a".Equals(strFormC)); //false
        Console.WriteLine("a".Equals(strFormKC)); //false
        Console.WriteLine("a".Equals(strFormKD)); //false
Aderbal Farias
  • 989
  • 10
  • 24
Ronald Abellano
  • 774
  • 10
  • 34
  • 1
    Does https://stackoverflow.com/questions/3288114/what-does-nets-string-normalize-do or https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html help your understanding? – mjwills Apr 06 '19 at 11:03

2 Answers2

10

You can use string.Compare() setting CultureInfo.InvariantCulture and CompareOptions.IgnoreNonSpace as you can see below I have created a method called CompareStrings(string str1, string str2), it will return a boolean

public bool CompareStrings(string str1, string str2)
{
    return string.Compare(str1, str2, CultureInfo.InvariantCulture, CompareOptions.IgnoreNonSpace) == 0; 
}

Calling the method to compare strings:

Console.WriteLine(CompareStrings("a", "á"));
Console.WriteLine(CompareStrings("a", "a"));
Console.WriteLine(CompareStrings("a", "b"));

Results:

True
True
False

The CompareOptions.IgnoreNonSpace definition: It "indicates that the string comparison must ignore nonspacing combining characters, such as diacritics. The Unicode Standard defines combining characters as characters that are combined with base characters to produce a new character. Nonspacing combining characters do not occupy a spacing position by themselves when rendered."

You can find out more about CompareOptions on docs

Aderbal Farias
  • 989
  • 10
  • 24
5

After normalization in forms D and KD, the string will contain two characters: a letter and a diacritical character. It is necessary to make a comparison with the letter.

string stra = "á";

string strFormC = stra.Normalize(NormalizationForm.FormC);
string strFormD = stra.Normalize(NormalizationForm.FormD);
string strFormKC = stra.Normalize(NormalizationForm.FormKC);
string strFormKD = stra.Normalize(NormalizationForm.FormKD);

Console.WriteLine("C {0}", strFormC.Length); // 1
Console.WriteLine("D {0}", strFormD.Length); // 2
Console.WriteLine("KC {0}", strFormKC.Length); // 1
Console.WriteLine("KD {0}", strFormKD.Length); // 2

Console.WriteLine("a".Equals(strFormD[0].ToString())); // True
Console.WriteLine("a".Equals(strFormKD[0].ToString())); // True

We can remove all diacritical characters with a regular expression.

\p{M} - is Unicode category means All diacritic marks.

string stra = "á";

string strFormD = stra.Normalize(NormalizationForm.FormD);

var result = Regex.Replace(strFormD, @"\p{M}", string.Empty);

Console.WriteLine("a".Equals(result)); // True
Console.WriteLine("a" == result); // True
Alexander Petrov
  • 13,457
  • 2
  • 20
  • 49