3

Unicode emphasizes that software should be as forward compatible as possible, by defaulting to treating unassigned characters as if they were a private use code point. This works well in most cases, as most new characters do not change when normalized, case folded, etc.

However, I want to analyze normalization "breaking" changes in Unicode: characters which have properties that would result in changes when applying NFx, NFKx, casefold, or NFKC_Casefold normalization. I'm not 100% confident in my understanding of the NFC or NFKC algorithms, and I believe that there have been some stability changes which limit the number of special cases. I can limit my analysis to Unicode 4, 5, or even 6 if it means not having to deal with special cases.

My initial stab at this parses the XML Unicode Character Database and selects code points based on the canonical combining class) (ccc != 0), NFxy quick check (NFC_QC != 'Y', NFD_QC != 'Y', etc), and casefolding/NFKC_Casefold (CWKCF = 'Y' or CWCF = 'Y') properties.

Is this the best approach, or should I just be looking at the decomposition mapping and type?

Indolering
  • 3,058
  • 30
  • 45

0 Answers0