3

Do canonically equivalent Unicode strings collate equal? Sometimes.

#include <iostream>
#include <cstring>
#include <clocale>
int main()
{
    std::setlocale(LC_COLLATE, "en_US.UTF-8");
    if (std::strcoll("\xc3\xa9", "e\xcc\x81"))
      std::cout << "FAIL: No Unicode normalization here" << std::endl;
    else
      std::cout << "WIN: Unicode normalization is performed" << std::endl;
}

This program results in a WIN on my Cygwin-ized Windows machine, and FAIL on every Linux system I can get my hands on.

Is this expected behaviour? Are there Linux systems that produce a WIN? What about Mac OS X? FreeBSD?

I know I can normalize and do canonical equivalence with third-party libraries. I'm interested in standard collation rules of UTF-8 locales.

This question is inspired by this one.

Community
  • 1
  • 1
n. m. could be an AI
  • 112,515
  • 14
  • 128
  • 243

1 Answers1

1

To the best of my knowledge, there is no mention of Unicode normalization neither in the C nor in the C++, nor in the POSIX standards.

Therefore, implementations may leave normalization as something to be done explicitely by the programmer.

More explicitely, in glibc european locales apparently use ISO 14651 as collation algorithm. The Unicode Collation FAQ implies that ISO 14651 doesn't do normalization: uniform handling of canonical equivalents is listed as a difference between the UCA and ISO 14651.

ninjalj
  • 42,493
  • 9
  • 106
  • 148
  • Neither standard mentions `*.UTF-8` locales either. The language standards mention `C` locale, POSIX adds `POSIX` locale. I'm interested in `*.UTF-8` locales, yes, in the implementation-specific ones. – n. m. could be an AI Nov 27 '13 at 11:18
  • Then you can only get implementation-specific answers, which you cannot rely on. In any case, I expect C/Unix-y behavior to be to not do normalization implicitely. – ninjalj Nov 27 '13 at 11:26
  • An implementation-specific answer is precisely what I'm expecting in response to an implementation-specific question. It says "linux" and "cygwin" and "en_US.UTF-8" quite openly. These are all implementation-specific things. Don't see anything wrong with that. – n. m. could be an AI Nov 27 '13 at 12:30