I think that the problem might be that the regex pattern matches one or the other of the subpatterns EN_EXTRACT_REGEX
and NUM_EXTRACT_REGEX
, but not both.
When re.sub()
matches the alpha characters in the first pattern it attempts to substitute the second group reference with \2
which fails because only the first group matched - there is no second group.
Similarly when the digit pattern is matched there is no \1
group to substitute and so this also fails.
You can see that this is the case with this test in Python 2:
>>> re.sub(AGGR_REGEX, r' \1', 'abcd') # reference first pattern
abcd
>>> re.sub(AGGR_REGEX, r' \2', 'abcd') # reference second pattern
Traceback (most recent call last):
....
sre_constants.error: unmatched group
The difference must lie within the different versions of the regex engine for Python 2 and Python 3. Unfortunately I can not provide a definitive reason for the difference, however, there is a documented change in version 3.5 for re.sub()
regarding unmatched groups:
Changed in version 3.5: Unmatched groups are replaced with an empty string.
which explains why it works in Python >= 3.5 but not in earlier versions: unmatched groups are basically ignored.
As a workaround you can change your pattern to handle both matches as a single group:
import re
EN_EXTRACT_REGEX = '[a-zA-Z]+'
NUM_EXTRACT_REGEX = '[0-9]+'
AGGR_REGEX = '(' + EN_EXTRACT_REGEX + '|' + NUM_EXTRACT_REGEX + ')'
# ([a-zA-Z]+|[0-9]+)
for s in '', '1234', 'abcd', 'a1b2c3', 'aa__bb__1122cdef', '_**_':
print(re.sub(AGGR_REGEX, r' \1', s))
Output
1234
abcd
a 1 b 2 c 3
aa__ bb__ 1122 cdef
_**_