What does .NET's String.Normalize do?

Question

The MSDN article on String.Normalize states simply:

Returns a new string whose binary representation is in a particular Unicode normalization form.

And sometimes referring to a "Unicode normalization form C."

I'm just wondering, what does that mean? How is this function useful in real life situations?

Hans Keﬆing · Answer 1 · 2023-06-02T08:57:40.897

One difference between form C and form D is how letters with accents are represented: form C uses a single letter-with-accent codepoint, while form D separates that into a letter and an accent.

For instance, an "à" can be codepoint 224 ("Latin small letter A with grave"), or codepoint 97 ("Latin small letter A") followed by codepoint 786 ("Combining grave accent"). A char-by-char comparison would see these as different. Normalisation lets the comparison succeed.

A side-effect is that this makes it possible to easily create a "remove accents" method.

public static string RemoveAccents(string input)
{
    return new string(input
        .Normalize(System.Text.NormalizationForm.FormD)
        .Where(c => CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
        .ToArray());
    // the normalization to FormD splits accented letters in letters+accents
    // the rest removes those accents (and other non-spacing characters)
    // and creates a new string from the remaining chars
}

Or have the "highly secure" ROT13 encoding work with accents:

string Rot13(string input)
{
    var v = input.Normalize(NormalizationForm.FormD)
        .Select(c => {
            if ((c>='a' && c<='m') || (c>='A' && c<='M'))
                return (char)(c+13);
            if ((c>='n' && c<='z') || (c>='N' && c<='Z'))
                return (char)(c-13);
            return c;
        });
    return new String(v.ToArray()).Normalize(NormalizationForm.FormC);
}

This will turn "Crème brûlée" into "Per̀zr oeĥyŕr" (and vice versa, of course), by first splitting "character with accent" codepoints in separate "character" and "accent" codepoints (FormD), then performing the ROT13 translation on just the letters and afterwards trying to recombine them (FormC).

In the `RemoveAccents` method, you do not really need `.ToCharArray()` since the `string` class is `IEnumerable` in itself (which you also take advantage of in the `Rot13` method). — Jeppe Stig Nielsen, May 31 '23 at 09:37

score 55 · Accepted Answer · answered Jul 20 '10 at 08:22

It makes sure that unicode strings can be compared for equality (even if they are using different unicode encodings).

From Unicode Standard Annex #15:

Essentially, the Unicode Normalization Algorithm puts all combining marks in a specified order, and uses rules for decomposition and composition to transform each string into one of the Unicode Normalization Forms. A binary comparison of the transformed strings will then determine equivalence.

Excellent answer. Provided link is great! – GeReV Jul 20 '10 at 08:54 — GeReV, Jul 20 '10 at 08:54

score 6 · Answer 3 · answered Jul 20 '10 at 08:33

In Unicode, a (composed) character can either have a unique code point, or a sequence of code points consisting of the base character and its accents.

Wikipedia lists as example Vietnamese ế (U+1EBF) and its decomposed sequence U+0065 (e) U+0302 (circumflex accent) U+0301 (acute accent).

string.Normalize() converts between the 4 normal forms a string can be coded in Unicode.

score 5 · Answer 4 · answered Jul 20 '10 at 08:22

5

This link has a good explanation:

http://unicode.org/reports/tr15/#Norm_Forms

From what I can surmise, its so you can compare two unicode strings for equality.

answered Jul 20 '10 at 08:22

Adam Houldsworth

63,413
11
150
187

What does .NET's String.Normalize do?

4 Answers4

Linked