Detecting Normalization Breaking Changes in Unicode via the UCD

Question

Unicode emphasizes that software should be as forward compatible as possible, by defaulting to treating unassigned characters as if they were a private use code point. This works well in most cases, as most new characters do not change when normalized, case folded, etc.

However, I want to analyze normalization "breaking" changes in Unicode: characters which have properties that would result in changes when applying NFx, NFKx, casefold, or NFKC_Casefold normalization. I'm not 100% confident in my understanding of the NFC or NFKC algorithms, and I believe that there have been some stability changes which limit the number of special cases. I can limit my analysis to Unicode 4, 5, or even 6 if it means not having to deal with special cases.

My initial stab at this parses the XML Unicode Character Database and selects code points based on the canonical combining class) (ccc != 0), NFxy quick check (NFC_QC != 'Y', NFD_QC != 'Y', etc), and casefolding/NFKC_Casefold (CWKCF = 'Y' or CWCF = 'Y') properties.

Is this the best approach, or should I just be looking at the decomposition mapping and type?

Detecting Normalization Breaking Changes in Unicode via the UCD

0 Answers0