12

From my understanding,

(.)(?<!\1)

should never match. Actually, php's preg_replace even refuses to compile this and so does ruby's gsub. The python re module seems to have a different opinion though:

import re
test = 'xAAAAAyBBBBz'
print (re.sub(r'(.)(?<!\1)', r'(\g<0>)', test))

Result:

(x)AAAA(A)(y)BBB(B)(z)

Can anyone provide a reasonable explanation for this behavior?

Update

This behavior appears to be a limitation in the re module. The alternative regex module seems to handle groups in assertions correctly:

import regex

test = 'xAAAAAyBBBBz'

print (regex.sub(r'(.)(?<!\1)', r'(\g<0>)', test))
## xAAAAAyBBBBz

print (regex.sub(r'(.)(.)(?<!\1)', r'(\g<0>)', test))
## (xA)AAA(Ay)BBB(Bz)

Note that unlike pcre, regex also allows variable-width lookbehinds:

print (regex.sub(r'(.)(?<![A-Z]+)', r'(\g<0>)', test))
## (x)AAAAA(y)BBBB(z)

Eventually, regex is going to be included in the standard library, as mentioned in PEP 411.

Community
  • 1
  • 1
georg
  • 211,518
  • 52
  • 313
  • 390

1 Answers1

6

This does look like a limitation (nice way of saying "bug", as I learned from a support call with Microsoft) in the Python re module.

I guess it has to do with the fact that Python does not support variable-length lookbehind assertions, but it's not clever enough to figure out that \1 will always be fixed-length. Why it doesn't complain about this when compiling the regex, I can't say.

Funnily enough:

>>> print (re.sub(r'.(?<!\0)', r'(\g<0>)', test))
(x)(A)(A)(A)(A)(A)(y)(B)(B)(B)(B)(z)
>>>
>>> re.compile(r'(.*)(?<!\1)') # This should trigger an error but doesn't!
<_sre.SRE_Pattern object at 0x00000000026A89C0>

So better don't use backreferences in lookbehind assertions in Python. Positive lookbehind isn't much better (it also matches here as if it was a positive lookahead):

>>> print (re.sub(r'(.)(?<=\1)', r'(\g<0>)', test))
x(A)(A)(A)(A)Ay(B)(B)(B)Bz

And I can't even guess what's going on here:

>>> print (re.sub(r'(.+)(?<=\1)', r'(\g<0>)', test))
x(AA)(A)(A)Ay(BB)(B)Bz
Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • Thanks, this confirms my feelings about this being a bug. – georg Apr 24 '12 at 11:56
  • Python is apparently not the only language that has problems with backreference in lookbehind assertions: http://stackoverflow.com/questions/2734977/backreferences-in-lookbehind/2735611#2735611 – Tim Pietzcker May 02 '12 at 06:06