Assuming your code is written for Python 3, it works correctly from Python 3.3 and above, and fails with the same error message for Python 3.2.x and below.
Solution
The simplest solution is to run your code in Python 3.3 or above, and add a version guard to prevent lower version of Python to run your code.
The second solution is to use regular Unicode string literal, in which Unicode escape sequences are recognized and processed. The drawback of this method is that you have to mind the escape sequences and double up the \
when necessary, especially in the case of \b
, which is interpreted as backspace character in regular Unicode string literal before it reaches re.compile
.
# Python 3.2.5 (default, Jul 25 2014, 14:13:17)
>>> print('[\u0800-\u9fa5_a-zA-Z0-9_]+(?=\s\[pixiv\])')
[ࠀ-龥_a-zA-Z0-9_]+(?=\s\[pixiv\])
>>> import re
>>> re.compile('[\u0800-\u9fa5_a-zA-Z0-9_]+(?=\s\[pixiv\])', re.DEBUG)
max_repeat 1 4294967295
in
range (2048, 40869)
literal 95
range (97, 122)
range (65, 90)
range (48, 57)
literal 95
assert 1
in
category category_space
literal 91
literal 112
literal 105
literal 120
literal 105
literal 118
literal 93
<_sre.SRE_Pattern object at 0x6001fad70>
By the way, you might want to review your character range \u0800-\u9fa5
, since it also matches Arabic, Devanagari, Thai, Lao, Box Drawing, Symbols etc.
Explanation
Unicode escape sequences \u
and \U
in raw Unicode string
In Python 3, Unicode escape sequences \u
and \U
are not treated specially in raw Unicode string, as specified in Python 3.0. The specification of string literal is updated in Python 3.3 to add u
prefix for easier maintenance of Python 2 code, but it doesn't change the parsing behavior for raw Unicode string:
# Python 3.4.3 (v3.4.3:9b73f1c3e601, Feb 24 2015, 22:43:06) [MSC v.1600 32 bit (Intel)] on win32
>>> r'[\u8000]'
'[\\u8000]'
>>> '[\u8000]'
'[耀]'
This is in contrast with Python 2, where Unicode escape sequences are processed into corresponding Unicode character even in raw Unicode string literal:
# Python 2.7.8 (default, Jul 25 2014, 14:04:36)
>>> print(u'\u8000')
耀
>>> print(ur'\u8000')
耀
Therefore, the string containing the regex in the question, as seen by the regex engine in Python 3:
>>> print(r'[\u0800-\u9fa5_a-zA-Z0-9_]+(?=\s\[pixiv\])')
[\u0800-\u9fa5_a-zA-Z0-9_]+(?=\s\[pixiv\])
Support for Unicode escape sequence \u
and \U
in re
package
Before Python 3.3, re
package doesn't support \u
and \U
Unicode escape sequence, as seen in documentation for Python 3.2. As a result, \u
and \U
are interpreted as matching literal u
and U
.
Adding re.DEBUG
flag, you can see the resulting structure of the compiled regex. I annotate part of the output for clarity:
# Python 3.2.5 (default, Jul 25 2014, 14:13:17)
>>> import re
>>> re.compile(r'[\u0800-\u9fa5_a-zA-Z0-9_]+(?=\s\[pixiv\])', re.DEBUG)
max_repeat 1 4294967295
in
literal 117 # u (\u)
literal 48 # 0
literal 56 # 8
literal 48 # 0
range (48, 117) # 0-u (0-\u)
literal 57 # 9
literal 102 # f
literal 97 # a
literal 53 # 5
literal 95
range (97, 122)
range (65, 90)
range (48, 57)
literal 95
assert 1
in
category category_space
literal 91
literal 112
literal 105
literal 120
literal 105
literal 118
literal 93
<_sre.SRE_Pattern object at 0x600178850>
Python 3.3 finally added support for Unicode escape sequence in re
package, so it works correctly for subsequent versions:
# Python 3.4.3 (v3.4.3:9b73f1c3e601, Feb 24 2015, 22:43:06) [MSC v.1600 32 bit (Intel)] on win32
>>> re.compile(r'[\u0800-\u9fa5_a-zA-Z0-9_]+(?=\s\[pixiv\])', re.DEBUG);
max_repeat 1 2147483647
in
range (2048, 40869) # \u0800-\u9fa5
literal 95
range (97, 122)
range (65, 90)
range (48, 57)
literal 95
assert 1
in
category category_space
literal 91
literal 112
literal 105
literal 120
literal 105
literal 118
literal 93