In Python 3.x, the function locale.strxfrm(s)
internally uses the POSIX C function wcsxfrm(), which is based on current LC_COLLATE setting. The POSIX standard define the transformation in this way:
The transformation shall be such that if wcscmp()
is applied to two
transformed wide strings, it shall return a value greater than, equal
to, or less than 0, corresponding to the result of wcscoll()
applied
to the same two original wide-character strings.
This definition can be implemented in multiple ways, and doesn't even require that the resulting string is readable.
I've created a little C code example to demonstrate how it works:
#include <stdio.h>
#include <wchar.h>
#include <locale.h>
int main() {
wchar_t buf[10];
wchar_t *in = L"\x10fefd";
int i;
setlocale(LC_COLLATE, "en_US.UTF-8");
printf("in : ");
for(i=0;i<10 && in[i];i++)
printf(" 0x%x", in[i]);
printf("\n");
i = wcsxfrm(buf, in, 10);
printf("out: ");
for(i=0;i<10 && buf[i];i++)
printf(" 0x%x", buf[i]);
printf("\n");
}
It prints the string before and after the transformation.
Running it on Linux (Debian Jessie) this is the result:
in : 0x10fefd
out: 0x1 0x1 0x1 0x1 0x552
while running it on OSX (10.11.1) the result is:
in : 0x10fefd
out: 0x103 0x1 0x110000
You can see that the output of wcsxfrm()
on OSX contains the character U+110000 which is not permitted in a Python string, so this is the source of the error.
On Python 2.7 the error is not raised because its locale.strxfrm()
implementation is based on strxfrm()
C function.
UPDATE:
Investigating further, I see that the LC_COLLATE definition for en_US.UTF-8 on OSX is a link to la_LN.US-ASCII definition.
$ ls -l /usr/share/locale/en_US.UTF-8/LC_COLLATE
lrwxr-xr-x 1 root wheel 28 Oct 1 14:24 /usr/share/locale/en_US.UTF-8/LC_COLLATE -> ../la_LN.US-ASCII/LC_COLLATE
I found the actual definition in the sources from Apple. The content of file la_LN.US-ASCII.src
is the following:
order \
\x00;...;\xff
2nd UPDATE:
I've further tested the wcsxfrm()
function on OSX. Using the la_LN.US-ASCII collate, given a sequence of wide character C1..Cn
as input, the output is a string with this form:
W1..Wn \x01 U1..Un
where
Wx = 0x103 if Cx > 0xFF else Cx+0x3
Ux = Cx+0x103 if Cx > 0xFF else Cx+0x3
Using this algorithm \x10fefd
become 0x103 0x1 0x110000
I've checked and every UTF-8 locale use this collate on OSX, so I'm inclined to say that the collate support for UTF-8 on Apple systems is broken. The resulting ordering is almost the same of the one obtained whith normal byte comparison, with the bonus of the ability to obtain illegal Unicode characters.