9

I am experiencing an odd behavior when using the locale library with unicode input. Below is a minimum working example:

>>> x = '\U0010fefd'
>>> ord(x)
1113853
>>> ord('\U0010fefd') == 0X10fefd
True
>>> ord(x) <= 0X10ffff
True
>>> import locale
>>> locale.strxfrm(x)
'\U0010fefd'
>>> locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
'en_US.UTF-8'
>>> locale.strxfrm(x)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: character U+110000 is not in range [U+0000; U+10ffff]

I have seen this on Python 3.3, 3.4 and 3.5. I do not get an error on Python 2.7.

As far as I can see, my unicode input is within the appropriate unicode range, so it seems that somehow something internal to strxfrm when using the 'en_US.UTF-8' is moving the input out of range.

I am running Mac OS X, and this behavior may be related to http://bugs.python.org/issue23195... but I was under the impression this bug would only manifest as incorrect results, not a raised exception. I cannot replicate on my SLES 11 machine, and others confirm they cannot replicate on Ubuntu, Centos, or Windows. It may be instructive to hear about other OS's in the comments.

Can someone explain what may be happening here under the hood?

SethMMorton
  • 45,752
  • 12
  • 65
  • 86
  • I can't reproduce it on Ubuntu. `locale.strxfrm(x)` returns `'\x01\x01\x01\x01Ւ'` in `en_US.UTF-8` locale. – jfs Nov 01 '15 at 20:45
  • 1
    you could use [`icu.Collator.createInstance(icu.Locale('en_US')).getSortKey` instead](http://stackoverflow.com/a/32178778/4279) – jfs Nov 01 '15 at 20:46
  • @J.F.Sebastian Yes, I have used PyICU and confirm there is no problem there. I was more concerned about this behavior in the stdlib `locale` module and if this was some sort of user error (i.e. I did something wrong) or if there is something more nefarious going on. – SethMMorton Nov 01 '15 at 22:01
  • @J.F.Sebastian I am on Mac OS X. I have found other issues in the past with the built-in `locale` library on OSX (see for example http://stackoverflow.com/q/3412933/1399279 and http://bugs.python.org/issue23195). In the past, the problems had always just been incorrect results. I can deal with incorrect results, but when some built-in bug causes my program to halt I raise red flags. – SethMMorton Nov 01 '15 at 22:05
  • 1
    No error also on Centos 7 / Python 3.4. – VPfB Nov 11 '15 at 07:44
  • I wonder if this is similar to [MacOSX backend unicode problems in python 3.3](https://github.com/matplotlib/matplotlib/issues/1737/). They were getting a similar error `ValueError: character U+55002f is not in range [U+0000; U+10ffff]`. Quoting from that discussion: "...it appears as if the macosx.m assumes that unichar (from Apple's libraries) and Py_UNICODE (from Python) are the same size. This was true for all versions of Python prior to 3.2, but with 3.3, Python went 4-bytes across the board (at least at the API level)." – Michelle Welcks Nov 11 '15 at 09:10
  • @VPfB Thanks for trying to replicate. If this is in fact related to the [linked bug report](http://bugs.python.org/issue23195) (which only affects BSD-like systems) then I am not surprised that Centos has no issues. – SethMMorton Nov 11 '15 at 15:11
  • 1
    I gave it a try on Windows using Anaconda3 distribution (Python 3.4). The locale settings are different `locale.setlocale(locale.LC_ALL, 'English_United States.1252')`, there is no error, the output is `'ÿ\x81·û\x01>?\x01>?\x01\x01'` – rll Nov 13 '15 at 15:23

1 Answers1

9

In Python 3.x, the function locale.strxfrm(s) internally uses the POSIX C function wcsxfrm(), which is based on current LC_COLLATE setting. The POSIX standard define the transformation in this way:

The transformation shall be such that if wcscmp() is applied to two transformed wide strings, it shall return a value greater than, equal to, or less than 0, corresponding to the result of wcscoll() applied to the same two original wide-character strings.

This definition can be implemented in multiple ways, and doesn't even require that the resulting string is readable.

I've created a little C code example to demonstrate how it works:

#include <stdio.h>
#include <wchar.h>
#include <locale.h>

int main() {
  wchar_t buf[10];
  wchar_t *in = L"\x10fefd";
  int i;

  setlocale(LC_COLLATE, "en_US.UTF-8");

  printf("in : ");
  for(i=0;i<10 && in[i];i++)
    printf(" 0x%x", in[i]);
  printf("\n");

  i = wcsxfrm(buf, in, 10);

  printf("out: ");
  for(i=0;i<10 && buf[i];i++)
    printf(" 0x%x", buf[i]);
  printf("\n");
}

It prints the string before and after the transformation.

Running it on Linux (Debian Jessie) this is the result:

in : 0x10fefd
out: 0x1 0x1 0x1 0x1 0x552

while running it on OSX (10.11.1) the result is:

in : 0x10fefd
out: 0x103 0x1 0x110000

You can see that the output of wcsxfrm() on OSX contains the character U+110000 which is not permitted in a Python string, so this is the source of the error.

On Python 2.7 the error is not raised because its locale.strxfrm() implementation is based on strxfrm() C function.

UPDATE:

Investigating further, I see that the LC_COLLATE definition for en_US.UTF-8 on OSX is a link to la_LN.US-ASCII definition.

$ ls -l /usr/share/locale/en_US.UTF-8/LC_COLLATE
lrwxr-xr-x 1 root wheel 28 Oct  1 14:24 /usr/share/locale/en_US.UTF-8/LC_COLLATE -> ../la_LN.US-ASCII/LC_COLLATE

I found the actual definition in the sources from Apple. The content of file la_LN.US-ASCII.src is the following:

order \
    \x00;...;\xff

2nd UPDATE:

I've further tested the wcsxfrm() function on OSX. Using the la_LN.US-ASCII collate, given a sequence of wide character C1..Cn as input, the output is a string with this form:

W1..Wn \x01 U1..Un

where

Wx = 0x103 if Cx > 0xFF else Cx+0x3
Ux = Cx+0x103 if Cx > 0xFF else Cx+0x3

Using this algorithm \x10fefd become 0x103 0x1 0x110000

I've checked and every UTF-8 locale use this collate on OSX, so I'm inclined to say that the collate support for UTF-8 on Apple systems is broken. The resulting ordering is almost the same of the one obtained whith normal byte comparison, with the bonus of the ability to obtain illegal Unicode characters.

mnencia
  • 3,298
  • 1
  • 23
  • 35
  • Huh. So it sounds like there is not much I can do to prevent the `ValueError` since this is coming from the underlying C library, outside of Python's control. – SethMMorton Nov 14 '15 at 06:01
  • I wonder if this would be considered a bug. Assuming that `0x110000` is a valid return value for `wcsxfrm()` then Python should internally be able to handle it, correct? However, if `0x110000` is not valid then I suppose what Python is doing would be "correct". – SethMMorton Nov 14 '15 at 06:11
  • It seems as though this came up 4 years ago: https://mail.python.org/pipermail/python-dev/2011-December/114759.html and http://bugs.python.org/issue13441. From my eyes it doesn't look like they found a solution to the errors for values >= `0x110000`, but the consensus was they definitely don't want them. – SethMMorton Nov 14 '15 at 06:42
  • I've updated the answer with further information on how the `wcsxfrm()` works. My conclusion is that OSX collate support is definitely broken for UTF-8 encodings. – mnencia Nov 14 '15 at 23:06
  • I did a test, and I can use `strcoll` instead of `strxfrm` on these characters without any `ValueError`; this is because `wcscoll` just returns an `int` rather than a transformed unicode string so this unicode outside of range issue is internal to the C library only. It's too bad though because using `strcoll` has a performance hit for large datasets. – SethMMorton Nov 15 '15 at 19:08