2

Using Python 2.7, re

I'm trying to compile unicode character classes. I can get it to work with 4 digit ranges (u'\uxxxx') but not 8 digits (u'\Uxxxxxxxx').I

The following works:

re.compile(u'[\u0010-\u0012]')

The following does not:

re.compile(u'[\U00010000-\U00010001]')

The resultant error is:

Traceback (most recent call last): File "", line 1, in File "C:\Python27\lib\re.py", line 190, in compile return _compile(pattern, flags) File "C:\Python27\lib\re.py", line 242, in _compile raise error, v # invalid expression error: bad character range

It appears to be an issue with 8 digit ranges only as the following works:

re.compile(u'\U00010000')

Separate question, I am new to stackoverflow and I am really struggling with how to post questions. I would expect that Trackback to appear on multiple lines, not on one line. I would also like to be able to paste in content copied from the interpreter but this UI makes a mess out of '>>>'

Don't know how to add this in a comment editing question.

The expression I really want to compile is:

re.compile(u'[\U00010000-\U0010FFFF]')

Expanding it with list(u'[\U00010000-\U0010FFFF]') looks pretty intractable as far as extending the suggested workaround:

>>> list(u'[\U00010000-\U0010FFFF]')
[u'[', u'\ud800', u'\udc00', u'-', u'\udbff', u'\udfff', u']']
Ted Speers
  • 71
  • 5
  • As to your Stack Overflow issues... yes, copying and pasting is not as friendly as you might like. To paste in the code from the interpreter, you are going to need to place 4 spaces before it to mark it as code. It annoyed me enough that I just wrote a python script to automatically add the spaces. def stackify_string(input_string): return ''.join([' ' + x for x in input_string.split('\n')]) – Nick Bailey Feb 28 '15 at 15:38
  • 1
    @NickBailey You could just select the code and then click the `{ }` button on the toolbar to indent them. – kennytm Feb 28 '15 at 15:57
  • 1
    What is the value of `sys.maxunicode`? – Ignacio Vazquez-Abrams Feb 28 '15 at 16:09
  • 1
    Related: [Python, convert 4-byte char to avoid MySQL error "Incorrect string value:"](http://stackoverflow.com/q/12636489) and [remove unicode emoji using re in python](http://stackoverflow.com/q/26568722) – Martijn Pieters Feb 28 '15 at 16:23
  • sys.maxunicode was 65535. i changed it to 4294967295L but the expression still does not compile – Ted Speers Feb 28 '15 at 16:45
  • @kennytm... and I just earned my embarrassing ignorance SO badge! – Nick Bailey Feb 28 '15 at 17:59

1 Answers1

4

Depending on the compilation option, Python 2 may store Unicode strings as UTF-16 code units, and thus \U00010000 is actually a two-code-unit string:

>>> list(u'[\U00010000-\U00010001]')
[u'[', u'\ud800', u'\udc00', u'-', u'\ud800', u'\udc01', u']']

The regex parser thus sees the character class containing \udc00-\ud800 which is a "bad character range". In this setting I can't think of a solution other than to match the surrogate pairs explicitly (after ensuring sys.maxunicode == 0xffff):

>>> r = re.compile(u'\ud800[\udc00-\udc01]')
>>> r.match(u'\U00010000')
<_sre.SRE_Match object at 0x10cf6f440>
>>> r.match(u'\U00010001')
<_sre.SRE_Match object at 0x10cf4ed98>
>>> r.match(u'\U00010002')
>>> r.match(u'\U00020000')
kennytm
  • 510,854
  • 105
  • 1,084
  • 1,005
  • thanks for your insight ... i edited the original ? to include the expression I'm really trying to compile (u'[\U00010000-\U0010FFFF]') ... I'd be surprised if your workaround is scaleable – Ted Speers Feb 28 '15 at 17:52
  • @TedSpeers: This regex is the same as the one in http://stackoverflow.com/questions/12636489/python-convert-4-byte-char-to-avoid-mysql-error-incorrect-string-value. You could use `u'[\uD800-\uDBFF][\uDC00-\uDFFF]'` as stated in the answer of that question. – kennytm Mar 01 '15 at 06:52
  • Thanks kennytm ... I wouldn't have figured that out any time soon. – Ted Speers Mar 09 '15 at 21:27