0

Autocad DXF and DWG files use unicode strings to identify layers. I've determined experimentally that Autocad must employ some sort of case folding and normalisation (Autocad considers 'groß' and 'GROSS' to be the same, and 'Am\U+00e9lie' and 'Ame\U+0301lie' to be the same). I'd like to know in my own software if two layer names are the same according to Autocad. Default Caseless Matching algorithm from the Unicode standard seems to give me the right answer but I'd like to be sure.

  1. Can anyone conform that Default Caseless Matching is the algorithm used by Autocad? Or if it isn't what is.

  2. Are there test inputs I can use to distinguish between different caseless matching algorithms?

Peter Graham
  • 11,323
  • 7
  • 40
  • 42
  • it depends on locale, not algorithm. [`ß` will be equal to `SS` in some German locales](https://blogs.msdn.microsoft.com/oldnewthing/20030905-00/?p=42643) – phuclv Feb 12 '18 at 03:40
  • 1
    @LưuVĩnhPhúc The blog post is about case mapping but the question is about case folding. `ß` will typically be case-folded to `ss` under any locale. Also, the question asks about the Unicode Default Caseless Matching algorithm which isn't locale dependent. – nwellnhof Feb 12 '18 at 12:06
  • @LưuVĩnhPhúc Case operations are different in different languages for sure. The Unicode standard defines default case operations that are supposed to be good general defaults for all languages but I don't know how true that is. My question is whether Autocad is using the default case folding algorithm defined by Unicode or some other system. There are a large number of potential case folding algorithms Autocad might use and I'd like to be sure I have the right one. – Peter Graham Feb 12 '18 at 20:59

2 Answers2

1

I don't have a definite answer, but the Unicode standard defines four algorithms for caseless matching:

  1. Default Caseless Matching (D144): This only uses (full) case folding but no normalization. Since you mentioned that Am\U+00e9lie and Ame\U+0301lie match, this variant can definitely be ruled out.

  2. Canonical caseless matching (D145): This uses (standard NFC or NFD) normalization in addition to case folding.

  3. Compatibility caseless matching (D146): This uses the "compatibility" (NFKC or NFKD) normalization form in addition to case folding.

  4. Identifier caseless matching (D147): Like compatibility caseless matching but also ignores Default Ignorable characters.

So I'd suggest the following additional tests:

  • If \U+0133 (LATIN SMALL LIGATURE IJ with a compatibility mapping) and ij match, then Autocad seems to use compatibility normalization and canonical caseless matching (D145) can be ruled out.

  • If A\U+00adB (SOFT HYPHEN with property Default_Ignorable_Code_Point) and AB match, then Autocad seems to ignore Default Ignorable characters and compatibility caseless matching (D146) can be ruled out.

It's of course possible that Autocad uses neither of the Unicode algorithms, but the tests above should help to narrow it down. Please consider to post any additional findings to help other users.

nwellnhof
  • 32,319
  • 7
  • 89
  • 113
0

I intercepted the api calls and discovered that Autocad 2018 on Windows uses CompareStringW(LOCALE_USER_DEFAULT, NORM_IGNORECASE | SORT_STRINGSORT, ...) to check layer names for equality.

Peter Graham
  • 11,323
  • 7
  • 40
  • 42