16

On Python 3.7 (tested on Windows 64 bits), the replacement of a string using the RegEx .* gives the input string repeated twice!

On Python 3.7.2:

>>> import re
>>> re.sub(".*", "(replacement)", "sample text")
'(replacement)(replacement)'

On Python 3.6.4:

>>> import re
>>> re.sub(".*", "(replacement)", "sample text")
'(replacement)'

On Python 2.7.5 (32 bits):

>>> import re
>>> re.sub(".*", "(replacement)", "sample text")
'(replacement)'

What is wrong? How to fix that?

Tomerikoo
  • 18,379
  • 16
  • 47
  • 61
Laurent LAPORTE
  • 21,958
  • 6
  • 58
  • 103
  • 2
    Clearly a bug. Not sure what kind of answer you're hoping to get. – Aran-Fey Feb 15 '19 at 16:42
  • 3
    Only happens with `.*` (or `.*$`), not with `.+` or `^.*`. And, well, you have an *infinite* number of zero-byte strings at the end of your match, so you might as well be glad that you get only one repetition. :) – Charles Duffy Feb 15 '19 at 16:43
  • There's not much SO can do about this, take a look at https://bugs.python.org to see if it's reported yet. – jonrsharpe Feb 15 '19 at 16:43
  • 10
    Looks like it's an intentional change: [*"Yes, this is an intended change. Your pattern matches an empty string at the end of the input string. It was a bug in earlier Python versions that re.sub() didn't replace empty matches adjacent to a previous non-empty match."*](https://bugs.python.org/issue34982) – jonrsharpe Feb 15 '19 at 16:46
  • @jonrsharpe Wow, that's *intentional*?! That's *ridiculous*. Guess it's best to avoid using regex in python from now on... – Aran-Fey Feb 15 '19 at 16:50
  • @Aran-Fey, ...how's it ridiculous? An empty string legitimately matches `.*`. – Charles Duffy Feb 15 '19 at 16:51
  • 1
    @CharlesDuffy So why does it only match twice? An empty string could match any number of times. Makes no sense whatsoever. The two logical choices would be 1 match or an infinite number of matches. Making it match twice is completely arbitrary and not logical at all. – Aran-Fey Feb 15 '19 at 16:53
  • 3
    @Aran-Fey, since `.*` is greedy, I expect to get '(replacement)' only once. Why two? – Laurent LAPORTE Feb 15 '19 at 16:56
  • @Aran-Fey, ...maybe this comes from spending too much time writing C, but I'm a big believer that developers should work from the language specification, not from observed behavior. If you're writing code that depends on implementation artifacts rather than documented guarantees, you're Doing It Wrong (and future versions of the language runtime are within their rights to cause surprises). – Charles Duffy Feb 15 '19 at 16:56
  • 2
    Is it really a bug in Python though? Even the [PCRE regex](https://regex101.com/r/X1Fvuy/2/) behaves exactly the same. Also note the [Python variant](https://regex101.com/r/X1Fvuy/1/) behaves the same on regex101 as well, but they could be using `3.7.2` also... oddly enough if you had *nothing* in there, the replacement only happens once. I'm guessing beginning of string `^` and end of string `$` counts as two empty space characters? – r.ook Feb 15 '19 at 17:14
  • 1
    @Aran-Fey it's matching twice because the first match is the entire string, and the second match is the empty "end of string" string. https://regex101.com/r/X1Fvuy/1/ – r.ook Feb 15 '19 at 17:22
  • @Idlehands Why does the first match, against the entire string, not include the zero-length end of string? I know that some other engines (not all, not even all that are “PCRE”) do the same but it’s really unintuitive and, I claim, virtually *never* the intended behaviour. – Konrad Rudolph Feb 15 '19 at 17:52
  • @KonradRudolph Good question, I was merely the observer of this behaviour but due to your question I looked deeper. I believe it has more to do with how the Engine was designed, [two relevant (Looking inside the Regex Engine)](https://www.regular-expressions.info/anchors.html) searches [I've found (The Greedy Trap)](https://www.rexegg.com/regex-quantifiers.html#greedytrap) illustrate how the Engine backtracks upon the match. Since `.*` literally matches any empty space, it back tracked on the empty space and therefore, gave one last match at the end. That's how I understood it anyhow. – r.ook Feb 15 '19 at 18:05
  • @Idlehands This is not about backtracking, but about how non-zero-width matches never consume the zero-width slot that follows the match. – blhsing Feb 15 '19 at 18:05
  • 1
    @blhsing Thanks for the clarification, shows how little I really understands about regex :) – r.ook Feb 15 '19 at 18:06

2 Answers2

20

This is not a bug, but a bug fix in Python 3.7 from the commit fbb490fd2f38bd817d99c20c05121ad0168a38ee.

In regex, a non-zero-width match moves the pointer position to the end of the match, so that the next assertion, zero-width or not, can continue to match from the position following the match. So in your example, after .* greedily matches and consumes the entire string, the fact that the pointer is then moved to the end of the string still actually leaves "room" for a zero-width match at that position, as can be evident from the following code, which behaves the same in Python 2.7, 3.6 and 3.7:

>>> re.findall(".*", 'sample text')
['sample text', '']

So the bug fix, which is about replacement of a zero-width match right after a non-zero-width match, now correctly replaces both matches with the replacement text.

blhsing
  • 91,368
  • 6
  • 71
  • 106
0

This is a common regex issue, it affects a lot of regex flavors, see related

There are several ways to fix the issue:

  • Add anchors on both sides of .*: re.sub("^.*$", "(replacement)", "sample text")
  • Since you want to only match a line once, add the count=1 argument: print( re.sub(".*", "(replacement)", "sample text", count=1) )
  • In case you want to replace any non-empty line, replace * with +: print( re.sub(".+", "(replacement)", "sample text") )

See the Python demo:

import re
# Adding anchors:
print( re.sub("^.*$", "(replacement)", "sample text") ) # => (replacement)
# Using the count=1 argument
print( re.sub(".*", "(replacement)", "sample text", count=1) ) # => (replacement)
# If you want to replace non-empty lines:
print( re.sub(".+", "(replacement)", "sample text") ) # => (replacement)
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563