Do canonically equivalent Unicode strings collate equal? Sometimes.
#include <iostream>
#include <cstring>
#include <clocale>
int main()
{
std::setlocale(LC_COLLATE, "en_US.UTF-8");
if (std::strcoll("\xc3\xa9", "e\xcc\x81"))
std::cout << "FAIL: No Unicode normalization here" << std::endl;
else
std::cout << "WIN: Unicode normalization is performed" << std::endl;
}
This program results in a WIN on my Cygwin-ized Windows machine, and FAIL on every Linux system I can get my hands on.
Is this expected behaviour? Are there Linux systems that produce a WIN? What about Mac OS X? FreeBSD?
I know I can normalize and do canonical equivalence with third-party libraries. I'm interested in standard collation rules of UTF-8 locales.
This question is inspired by this one.