Where can I get examples of unicode that normalizes differently?

Question

I'm adding yet another unicode normalization question because I've spent quite a bit of time looking and can't find what I need. I have a situation where I need to normalize unicode to check if strings are equivalent, but I don't understand the consequences of choosing different normal forms. What I would like to do is get some example valid unicode input that normalizes differently so I can play around with the different options, but I don't know how to make it or where I could find it. This answer has some example data but the examples are focused on malformed or invalid unicode strings (I think? Maybe I don't know what I'm looking at). I need a set of strings users will expect to be equivalent, an interface will accept as valid, and that are not equal until normalized. Let's say UTF-8 to be specific but I'd appreciate examples for multiple encodings. I'm working with python if there are answers that depend on implementation, but I imagine others might appreciate answers that are not limited to python.

Where can I get example unicode strings that are equivalent under some normal forms and not others, preferably demonstrating how all the normalizations differ?

https://stackoverflow.com/questions/15985888/when-to-use-unicode-normalization-forms-nfc-and-nfd has some good examples (including a deleted answer with a rant about the compatibility forms, but it was deleted because the question is about the canonical forms). https://towardsdatascience.com/difference-between-nfd-nfc-nfkd-and-nfkc-explained-with-python-code-e2631f96ae6c has some examples from Japanese but I think the summary promises more than the content delivers (or maybe I don't understand enough about the examples, though I can vaguely read the katakana). — tripleee, Mar 30 '22 at 16:50
Also https://unicode.org/reports/tr15/#Norm_Forms has a good number of examples. — tripleee, Mar 30 '22 at 16:53
somehow didn't find annex 15 before. If you post an answer I can accept it — shortorian, Mar 30 '22 at 17:13
For the secondary question: one normalized form is equivalent to the other: not much pro-contra for one or the other, in fact: Apple prefer decomposition canonical forms (as original Unicode intent), and Microsoft the composition canonical form. Fonts may have a different preference, but the font library will take care about changing form. — Giacomo Catenazzi, Mar 31 '22 at 07:11

score 2 · Accepted Answer · answered Mar 30 '22 at 17:48

2

https://unicode.org/reports/tr15/#Norm_Forms has a good number of examples, and a significant amount of explanations around them.

answered Mar 30 '22 at 17:48

tripleee

175,061
34
275
318

the tds post you gave in comments above is also useful although it does require a free account to view: https://towardsdatascience.com/difference-between-nfd-nfc-nfkd-and-nfkc-explained-with-python-code-e2631f96ae6c – shortorian Mar 30 '22 at 17:53
Hmm, I was able to visit it without anything like that, but perhaps they have a limit on how many articles you can view before they paywall you. – tripleee Mar 30 '22 at 17:55
1

In addition, there is also the UCD (Unicode Character Database) where you have the official normalization: https://www.unicode.org/Public/UCD/latest/ucd/ It is in `Unicodedata.txt`. Note: e.g. U+2000 is canonical compatible (and normalized to) U+2002, which is compatibility normalized with U+0020 – Giacomo Catenazzi Mar 31 '22 at 07:01

Where can I get examples of unicode that normalizes differently?

1 Answers1