Unicode, regular expressions and PyPy

Question

I wrote a program to add (limited) unicode support to Python regexes, and while it's working fine on CPython 2.5.2 it's not working on PyPy (~~1.5.0-alpha0~~ 1.8.0, implementing Python ~~2.7.1~~ 2.7.2), both running on Windows XP (Edit: as seen in the comments, @dbaupp could run it fine on Linux). I have no idea why, but I suspect it has something to do with my uses of u" and ur". The full source is here, and the relevant bits are:

# -*- coding:utf-8 -*-
import re

# Regexps to match characters in the BMP according to their Unicode category.
# Extracted from Unicode specification, version 5.0.0, source:
# http://unicode.org/versions/Unicode5.0.0/
unicode_categories = {
    ur'Pi':ur'[\u00ab\u2018\u201b\u201c\u201f\u2039\u2e02\u2e04\u2e09\u2e0c\u2e1c]',
    ur'Sk':ur'[\u005e\u0060\u00a8\u00af\u00b4\u00b8\u02c2-\u02c5\u02d2-\u02df\u02...',
    ur'Sm':ur'[\u002b\u003c-\u003e\u007c\u007e\u00ac\u00b1\u00d7\u00f7\u03f6\u204...',
    ...
    ur'Pf':ur'[\u00bb\u2019\u201d\u203a\u2e03\u2e05\u2e0a\u2e0d\u2e1d]',
    ur'Me':ur'[\u0488\u0489\u06de\u20dd-\u20e0\u20e2-\u20e4]',
    ur'Mc':ur'[\u0903\u093e-\u0940\u0949-\u094c\u0982\u0983\u09be-\u09c0\u09c7\u0...',
}

def hack_regexp(regexp_string):
    for (k,v) in unicode_categories.items():
        regexp_string = regexp_string.replace((ur'\p{%s}' % k),v)
    return regexp_string

def regex(regexp_string,flags=0):
    """Shortcut for re.compile that also translates and add the UNICODE flag

    Example usage:
        >>> from unicode_hack import regex
        >>> result = regex(ur'^\p{Ll}\p{L}*').match(u'áÇñ123')
        >>> print result.group(0)
        áÇñ
        >>> 
    """
    return re.compile(hack_regexp(regexp_string), flags | re.UNICODE)

(on PyPy there is no match in the "Example usage", so result is None)

Reiterating, the program works fine (on CPython): the Unicode data seems correct, the replace works as intended, the usage example runs ok (both via doctest and directly typing it in the command line). The source file encoding is also correct, and the coding directive in the header seems to be recognized by Python.

Any ideas of what PyPy does "different" that is breaking my code? Many things came to my head (unrecognized coding header, different encodings in the command line, different interpretations of r and u) but as far as my tests go, both CPython and PyPy seems to behave identically, so I'm clueless about what to try next.

Is there any particular reason you are using such an old unstable version of PyPy? (The latest stable version is 1.8.) — huon, May 06 '12 at 13:53
Also, the example given works fine for me using `[PyPy 1.8.0 with GCC 4.4.3] on linux2`. So it looks like the thing to try next is upgrade your PyPy. — huon, May 06 '12 at 13:59
@dbaupp uh... bacause that is what's installed on my machine? (hey, it was new when I installed it...) Now, seriously, I just upgraded it to 1.8.0 and still getting the same results. Since you managed to make it work on linux, maybe the problem is restricted to Windows then. I'll investigate further. — mgibsonbr, May 06 '12 at 14:05
Ah, drat. (For the record, it worked straight off for me, I just downloaded the file, ran your example and it was all fine.) — huon, May 06 '12 at 14:12
Doesn't PyPy have an open bug tracking system? Have you tried searching it? — Karl Knechtel, May 06 '12 at 15:00
@KarlKnechtel will do. that's probably right (a bug in PyPy), after the problem were narrowed down to windows systems, I made further tests (see my answer below) and noticed an inconsistent behavior when entering Unicode data in the command line (not shown: same when printing to screen) and when that data came from a file, properly loaded using `codecs`. — mgibsonbr, May 06 '12 at 15:12

score 7 · Answer 1 · answered May 06 '12 at 22:13

7

Why aren’t you simply using Matthew Barnett’s super-recommended regexp module instead?

It works on both Python 3 and legacy Python 2, is a drop-in replacement for re, handles all the Unicode stuff you could want, and a whole lot more.

answered May 06 '12 at 22:13

tchrist

78,834
30
123
180

Sure, I considered using other regex engines (like Ponyguruma for instance), I may go with your suggestion in the end, thanks! But the problem here turned out to be not about regexes, but unicode support on PyPy on Windows (of course, when I asked the question I didn't know what it was, so a problem with regexes was a possibility). BTW just saw that the [bug report](https://bugs.pypy.org/issue1139) has been confirmed. – mgibsonbr May 07 '12 at 01:45

mgibsonbr · Accepted Answer · 2012-05-06T16:28:52.947

Seems PyPy has some encoding problems, both when reading a source file (unrecognized coding header, maybe) and when inputting/outputting in the command line. I replaced my example code with the following:

>>> from unicode_hack import regex
>>> result = regex(ur'^\p{Ll}\p{L}*').match(u'áÇñ123')
>>> print result.group(0) == u'áÇñ'
True
>>>

And it kept working on CPython and failing on PyPy. Replacing the "áÇñ" for its escaped characters - u'\xe1\xc7\xf1' - OTOH did the trick:

>>> from unicode_hack import regex
>>> result = regex(ur'^\p{Ll}\p{L}*').match(u'\xe1\xc7\xf1123')
>>> print result.group(0) == u'\xe1\xc7\xf1'
True
>>>

That worked fine on both. I believe the problem is restricted to these two scenarios (source loading and command line), since trying to open an UTF-8 file using codecs.open works fine. When I try to input the string "áÇñ" in the command line, or when I load the source code of "unicode_hack.py" using codecs, I get the same result on CPython:

>>> u'áÇñ'
u'\xe1\xc7\xf1'
>>> import codecs
>>> codecs.open('unicode_hack.py','r','utf8').read()[19171:19174]
u'\xe1\xc7\xf1'

but different results on PyPy:

>>>> u'áÇñ'
u'\xa0\u20ac\xa4'
>>>> import codecs
>>>> codecs.open('unicode_hack.py','r','utf8').read()[19171:19174]
u'\xe1\xc7\xf1'

Update: Issue1139 submitted on PyPy bug tracking system, let's see how that turns out...

Unicode, regular expressions and PyPy

2 Answers2

Linked