1

I'm adding yet another unicode normalization question because I've spent quite a bit of time looking and can't find what I need. I have a situation where I need to normalize unicode to check if strings are equivalent, but I don't understand the consequences of choosing different normal forms. What I would like to do is get some example valid unicode input that normalizes differently so I can play around with the different options, but I don't know how to make it or where I could find it. This answer has some example data but the examples are focused on malformed or invalid unicode strings (I think? Maybe I don't know what I'm looking at). I need a set of strings users will expect to be equivalent, an interface will accept as valid, and that are not equal until normalized. Let's say UTF-8 to be specific but I'd appreciate examples for multiple encodings. I'm working with python if there are answers that depend on implementation, but I imagine others might appreciate answers that are not limited to python.

Where can I get example unicode strings that are equivalent under some normal forms and not others, preferably demonstrating how all the normalizations differ?

shortorian
  • 1,082
  • 1
  • 10
  • 19
  • 2
    https://stackoverflow.com/questions/15985888/when-to-use-unicode-normalization-forms-nfc-and-nfd has some good examples (including a deleted answer with a rant about the compatibility forms, but it was deleted because the question is about the canonical forms). https://towardsdatascience.com/difference-between-nfd-nfc-nfkd-and-nfkc-explained-with-python-code-e2631f96ae6c has some examples from Japanese but I think the summary promises more than the content delivers (or maybe I don't understand enough about the examples, though I can vaguely read the katakana). – tripleee Mar 30 '22 at 16:50
  • 2
    Also https://unicode.org/reports/tr15/#Norm_Forms has a good number of examples. – tripleee Mar 30 '22 at 16:53
  • 1
    somehow didn't find annex 15 before. If you post an answer I can accept it – shortorian Mar 30 '22 at 17:13
  • 1
    For the secondary question: one normalized form is equivalent to the other: not much pro-contra for one or the other, in fact: Apple prefer decomposition canonical forms (as original Unicode intent), and Microsoft the composition canonical form. Fonts may have a different preference, but the font library will take care about changing form. – Giacomo Catenazzi Mar 31 '22 at 07:11

1 Answers1

2

https://unicode.org/reports/tr15/#Norm_Forms has a good number of examples, and a significant amount of explanations around them.

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • the tds post you gave in comments above is also useful although it does require a free account to view: https://towardsdatascience.com/difference-between-nfd-nfc-nfkd-and-nfkc-explained-with-python-code-e2631f96ae6c – shortorian Mar 30 '22 at 17:53
  • Hmm, I was able to visit it without anything like that, but perhaps they have a limit on how many articles you can view before they paywall you. – tripleee Mar 30 '22 at 17:55
  • 1
    In addition, there is also the UCD (Unicode Character Database) where you have the official normalization: https://www.unicode.org/Public/UCD/latest/ucd/ It is in `Unicodedata.txt`. Note: e.g. U+2000 is canonical compatible (and normalized to) U+2002, which is compatibility normalized with U+0020 – Giacomo Catenazzi Mar 31 '22 at 07:01