0

A python script of mine started misbehaving in recent versions. I tracked it down to a re substitution that behaves differently in python <=3.6 vs >= 3.7 Newer python versions make the substitution twice.

Did something break in python re or am I doing something wrong and finally got caught?

As I understand it, the regex r'[^_]*$' in the example code below should match everything after the last underscore ... or the whole string if there is no underscore.

In the following example, python 3.6 creates s == 'a_Z', whereas python 3.7 creates 'a_ZZ'

$ docker run --rm  python:3.6-alpine python -c "import re;s=re.sub(r'[^_]*$','Z','a_b');assert s == 'a_Z',s"

$ docker run --rm  python:3.7-alpine python -c "import re;s=re.sub(r'[^_]*$','Z','a_b');assert s == 'a_Z',s"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
AssertionError: a_ZZ

Same error with 3.8-alpine, 3.9-rc-buster.

Mark Borgerding
  • 8,117
  • 4
  • 30
  • 51
  • There are several *"Changed in version 3.7"* in https://docs.python.org/3/library/re.html, did you review those? – jonrsharpe Apr 16 '20 at 11:57

1 Answers1

2

Per re.sub:

Changed in version 3.7: Empty matches for the pattern are replaced when adjacent to a previous non-empty match.

There are two matches of your pattern in 'a_b', because the pattern includes *: the b; and an empty match after it. You can see this in e.g. Regex101, or using re.findall:

>>> re.findall(r'[^_]*$', 'a_b')
['b', '']

If you switch to +, you'll get the expected result.

jonrsharpe
  • 115,751
  • 26
  • 228
  • 437
  • @MarkBorgerding that would make sense, anything that removes or ignores the empty match. – jonrsharpe Apr 16 '20 at 12:07
  • So it looks like re.sub is now quite error prone with patterns that can match the empty string. – Mark Borgerding Apr 16 '20 at 12:08
  • adding `count=1` to re.sub limits the substitution, but is there any way to know that the one subst will be the longest (i.e. not the empty string) – Mark Borgerding Apr 16 '20 at 12:09
  • If you don't want to replace empty matches, write patterns that don't give empty matches (as I suggest in the answer, `r'[^_]+$'` would work). I wouldn't describe it as error prone, that change was a bugfix. – jonrsharpe Apr 16 '20 at 12:16
  • 1
    The rationale is at https://bugs.python.org/issue32308 – Mark Borgerding Apr 16 '20 at 12:20
  • Interestingly, this change also broke an example in python docs. look for "abxd" in https://docs.python.org/2/howto/regex.html#search-and-replace and https://docs.python.org/3/howto/regex.html#search-and-replace – Mark Borgerding Apr 16 '20 at 12:23
  • I tend to side with the naysayers at https://bugs.python.org/issue32308 : Anders Hovmöller ("those others are broken") and David Barnett ("baffled by the new behavior"). – Mark Borgerding Apr 16 '20 at 12:31
  • @MarkBorgerding please take that up on the bug tracker if you want; SO isn't a place to relitigate language changes, and I don't want to get into a discussion about whether that was a "correct" change. – jonrsharpe Apr 16 '20 at 12:33
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/211801/discussion-between-mark-borgerding-and-jonrsharpe). – Mark Borgerding Apr 16 '20 at 12:39