Python 2 and 3 're.sub' inconsistency

Question

I am writing a function to split numbers and some other things from text in python. The code looks something like this:

EN_EXTRACT_REGEX = '([a-zA-Z]+)'
NUM_EXTRACT_REGEX = '([0-9]+)'
AGGR_REGEX = EN_EXTRACT_REGEX + '|' + NUM_EXTRACT_REGEX

entry = re.sub(AGGR_REGEX, r' \1\2', entry)

Now, this code works perfectly fine in python3, but it does not work under python2 and get an "unmatched group" error.

The problem is, I need to support both versions, and I could not get it to work properly in python2 although I tried various other ways.

I am curious what could be the root of this problem, and is there any workaround for it?

mhawke · Accepted Answer · 2017-08-15T12:09:07.213

I think that the problem might be that the regex pattern matches one or the other of the subpatterns EN_EXTRACT_REGEX and NUM_EXTRACT_REGEX, but not both.

When re.sub() matches the alpha characters in the first pattern it attempts to substitute the second group reference with \2 which fails because only the first group matched - there is no second group.

Similarly when the digit pattern is matched there is no \1 group to substitute and so this also fails.

You can see that this is the case with this test in Python 2:

>>> re.sub(AGGR_REGEX, r' \1', 'abcd')    # reference first pattern
 abcd
>>> re.sub(AGGR_REGEX, r' \2', 'abcd')    # reference second pattern
Traceback (most recent call last):
....
sre_constants.error: unmatched group

The difference must lie within the different versions of the regex engine for Python 2 and Python 3. Unfortunately I can not provide a definitive reason for the difference, however, there is a documented change in version 3.5 for re.sub() regarding unmatched groups:

Changed in version 3.5: Unmatched groups are replaced with an empty string.

which explains why it works in Python >= 3.5 but not in earlier versions: unmatched groups are basically ignored.

As a workaround you can change your pattern to handle both matches as a single group:

import re

EN_EXTRACT_REGEX = '[a-zA-Z]+'
NUM_EXTRACT_REGEX = '[0-9]+'
AGGR_REGEX = '(' + EN_EXTRACT_REGEX + '|' + NUM_EXTRACT_REGEX + ')'
# ([a-zA-Z]+|[0-9]+)

for s in '', '1234', 'abcd', 'a1b2c3', 'aa__bb__1122cdef', '_**_':
    print(re.sub(AGGR_REGEX, r' \1', s))

Output


 1234
 abcd
 a 1 b 2 c 3
 aa__ bb__ 1122 cdef
_**_

Python 2 and 3 're.sub' inconsistency

1 Answers1