8

Is there a better way to do this?

$ python
Python 2.7.9 (default, Jul 16 2015, 14:54:10)
[GCC 4.1.2 20080704 (Red Hat 4.1.2-55)] on linux2

Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> re.sub(u'[\U0001d300-\U0001d356]', "", "")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/fast/services/lib/python2.7/re.py", line 155, in sub
    return _compile(pattern, flags).sub(repl, string, count)
  File "/home/fast/services/lib/python2.7/re.py", line 251, in _compile
    raise error, v # invalid expression
sre_constants.error: bad character range
chaimp
  • 16,897
  • 16
  • 53
  • 86
  • [This question](http://stackoverflow.com/questions/393843/python-and-regular-expression-with-unicode) seems similar but neither answer seems to work for this. – Barmar Jul 24 '15 at 05:56
  • You need to install wide build for that to work (probably). In narrow build, you have to write the regex like how you would do in JS (match 16-bit code units) – nhahtdh Jul 24 '15 at 06:02
  • The code works fine for me as is. I am also using Python 2.7.9 but mine is compiled for Debian (Stable/Jessie) with GCC 4.9.2, as opposed to the OP's RedHat with GCC 4.1.2. – John1024 Jul 24 '15 at 06:24

1 Answers1

27

Python narrow and wide build (Python versions below 3.3)

The error suggests that you are using "narrow" (UCS-2) build, which only supports Unicode code points up to 65535 as one "Unicode character"1. Characters whose code points are above 65536 are represented as surrogate pairs, which means that the Unicode string u'\U0001d300' consists of two "Unicode character" in narrow build.

Python 2.7.8 (default, Jul 25 2014, 14:04:36)
[GCC 4.8.3] on cygwin
>>> import sys; sys.maxunicode
65535
>>> len(u'\U0001d300')
2
>>> [hex(ord(i)) for i in u'\U0001d300']
['0xd834', '0xdf00']

In "wide" (UCS-4) build, all 1114111 code points are recognized as Unicode character, so the Unicode string u'\U0001d300' consists of exactly one "Unicode character"/Unicode code point.

Python 2.6.6 (r266:84292, May  1 2012, 13:52:17)
[GCC 4.4.6 20110731 (Red Hat 4.4.6-3)] on linux2
>>> import sys; sys.maxunicode
1114111
>>> len(u'\U0001d300')
1
>>> [hex(ord(i)) for i in u'\U0001d300']
['0x1d300']

1 I use "Unicode character" (in quotes) to refer to one character in Python Unicode string, not one Unicode code point. The number of "Unicode characters" in a string is the len() of the string. In "narrow" build, one "Unicode character" is a 16-bit code unit of UTF-16, so one astral character will appear as two "Unicode character". In "wide" build, one "Unicode character" always corresponds to one Unicode code point.

Matching astral plane characters with regex

Wide build

The regex in the question compiles correctly in "wide" build:

Python 2.6.6 (r266:84292, May  1 2012, 13:52:17)
[GCC 4.4.6 20110731 (Red Hat 4.4.6-3)] on linux2
>>> import re; re.compile(u'[\U0001d300-\U0001d356]', re.DEBUG)
in
  range (119552, 119638)
<_sre.SRE_Pattern object at 0x7f9f110386b8>

Narrow build

However, the same regex won't work in "narrow" build, since the engine does not recognize surrogate pairs. It just treats \ud834 as one character, then tries to create a character range from \udf00 to \ud834 and fails.

Python 2.7.8 (default, Jul 25 2014, 14:04:36)
[GCC 4.8.3] on cygwin
>>> [hex(ord(i)) for i in u'[\U0001d300-\U0001d356]']
['0x5b', '0xd834', '0xdf00', '0x2d', '0xd834', '0xdf56', '0x5d']

The workaround is to use the same method as done in ECMAScript, where we will construct the regex to match the surrogates representing the code point.

Python 2.7.8 (default, Jul 25 2014, 14:04:36)
[GCC 4.8.3] on cygwin
>>> import re; re.compile(u'\ud834[\udf00-\udf56]', re.DEBUG)
literal 55348
in
  range (57088, 57174)
<_sre.SRE_Pattern object at 0x6ffffe52210>
>>> input =  u'Sample \U0001d340. Another \U0001d305. Leave alone \U00011000'
>>> input
u'Sample \U0001d340. Another \U0001d305. Leave alone \U00011000'
>>> re.sub(u'\ud834[\udf00-\udf56]', '', input)
u'Sample . Another . Leave alone \U00011000'

Using regexpu to derive astral plane regex for Python narrow build

Since the construction to match astral plane characters in Python narrow build is the same as ES5, you can use regexpu, a tool to convert RegExp literal in ES6 to ES5, to do the conversion for you.

Just paste the equivalent regex in ES6 (note the u flag and \u{hh...h} syntax):

/[\u{1d300}-\u{1d356}]/u

and you get back the regex which can be used in both Python narrow build and ES5

/(?:\uD834[\uDF00-\uDF56])/

Do take note to remove the delimiter / in JavaScript RegExp literal when you want to use the regex in Python.

The tool is extremely useful when the range spread across multiple high surrogates (U+D800 to U+DBFF). For example, if we have to match the character range

/[\u{105c0}-\u{1cb40}]/u

The equivalent regex in Python narrow build and ES5 is

/(?:\uD801[\uDDC0-\uDFFF]|[\uD802-\uD831][\uDC00-\uDFFF]|\uD832[\uDC00-\uDF40])/

which is rather complex and error-prone to derive.

Python version 3.3 and above

Python 3.3 implements PEP 393, which eliminates the distinction between narrow build and wide build, and Python from now behaves like a wide build. This eliminates the problem in the question altogether.

Compatibility issues

While it's possible to workaround and match astral plane characters in Python narrow builds, going forward, it's best to change the execution environment by using Python wide builds, or port the code to use with Python 3.3 and above.

The regex code for narrow build is hard to read and maintain for average programmers, and it has to be completely rewritten when porting to Python 3.

Reference

Community
  • 1
  • 1
nhahtdh
  • 55,989
  • 15
  • 126
  • 162
  • 1
    wow, I wish I could give more than one upvote for this incredibly thorough answer. This is *very* helpful! Thank you – chaimp Jul 24 '15 at 14:26
  • 1
    After considering the excellent work-arounds suggested, I ended up re-installing python with wide builds. For reference, this is what that looked like: `$ ./configure --enable-unicode=ucs4 --prefix=/my/path --exec-prefix=/my/path` `$ make && make install` I am going to include something that checks for the wide build version and warns the user, using your suggestion to check sys.maxunicode. Much appreciated! – chaimp Jul 24 '15 at 14:49
  • Thank you for this great answer. I had a similar issue and ended up adding some code inside the function that does the regex substitution to check `sys.maxunicode` and in case of narrow builds replace all surrogate pairs (regex `[\uD800-\uDBFF][\uDC00-\uDFFF]`) with whatever I wanted the replacement for the "actual" unicode character to be. – ShreevatsaR Nov 22 '16 at 19:21