Why does re.sub replace the entire pattern, not just a capturing group within it?

Question

re.sub('a(b)','d','abc') yields dc, not adc.

Why does re.sub replace the entire capturing group, instead of just capturing group'(b)'?

You do not use it in the substitution part, so what do you expect? If you want to replace a "b" preceded by an "a", you need either `re.sub('ab','ad','abc')` or `re.sub('(a)b',r'\1d','abc')`, where `"\1"` refers to the capturing group. — DYZ, Feb 08 '17 at 04:18
Thanks! Expected that capturing group is replaced by default. The right approach looks less intuitive, but probably more flexible. — Nick, Feb 08 '17 at 04:24
@Nick: but the `re.sub` doc says it does exactly that, no mention of capturing groups: *"**replacing the leftmost non-overlapping occurrences of the pattern** in string"* — smci, Jul 20 '19 at 19:19

score 40 · Accepted Answer · answered Feb 08 '17 at 04:21

Because it's supposed to replace the whole occurrence of the pattern:

Return the string obtained by replacing the leftmost non-overlapping occurrences of the pattern in string by the replacement repl.

If it were to replace only some subgroup, then complex regexes with several groups wouldn't work. There are several possible solutions:

Specify pattern in full: re.sub('ab', 'ad', 'abc') - my favorite, as it's very readable and explicit.
Capture groups which you want to preserve and then refer to them in the pattern (note that it should be raw string to avoid escaping): re.sub('(a)b', r'\1d', 'abc')
Similar to previous option: provide a callback function as repl argument and make it process the Match object and return required result.
Use lookbehinds/lookaheds, which are not included in the match, but affect matching: re.sub('(?<=a)b', r'd', 'abxb') yields adxb. The ?<= in the beginning of the group says "it's a lookahead".

Just a quick tip: you can use `\1` **in you regex**: `re.match(r'([la]{2})-\1', 'la-la')`. It'll match what to group referenced (`1` in this cased) **matched** (not it's pattern), so this regex wouldn't match `la-al` for example. — math2001, Feb 08 '17 at 05:21

smci · Answer 2 · 2021-01-07T16:23:31.933

4

Because that's exactly what re.sub() doc tells you it's supposed to do:

the pattern 'a(b)' says "match 'a', with optional trailing 'b'". (It could match 'a' on its own, but there is no way it could ever match 'b' on its own as you seem to expect. If you meant that, use a non-greedy (a)??b).
the replacement-string is 'd'
hence on your string 'abc', it matches all of 'ab' and replaces it with 'd', thus result is 'dc'

If you want your desired output, you'd need a non-greedy match on the '(a)??':

>>> re.sub('(a)??b','d','abc')
'dc'

edited Jan 07 '21 at 16:23

answered Jul 20 '19 at 19:22

smci

32,567
20
113
146

1

@Basj: we asked the OP several times, and as far as I can see they only want an explanation why the capture group isn't present in the output, not a fix. – smci Jan 07 '21 at 13:25
@Basj: as you can see from comments, several of us have been asking the OP what they want for 4 years. And they never said *BADOUTPUT/GOODOUTPUT*, that's your label. They asked for an explanation why it works the way it does. Which I answered. I even tell them how to get what they want with one possible regex - see my last line. – smci Jan 07 '21 at 14:28

Mr Buisson · Answer 3 · 2021-02-02T20:29:33.867

I'm aware that this is not strictly answering the OP question, but this question can be hard to google (flooded by \1 explanation ...)

for those who like me came here because they'd like to actually replace a capture group that is not the first one by a string, without special knowledge of the string nor of the regex :

#find offset [start, end] of a captured group within string
r = regex.search(oldText).span(groupNb)
#slice the old string and insert replacementText in the middle 
newText = oldText[:r[0]] + replacementText + oldText[r[1]:]

I know it's the wanted behavior, but I still do not understand why re.sub can't specify the actual capture group that it should substitute on...

score 2 · Answer 4 · answered May 16 '18 at 07:49

2

import re

pattern = re.compile(r"I am (\d{1,2}) .*", re.IGNORECASE)

text = "i am 32 years old"

if re.match(pattern, text):
    print(
        re.sub(pattern, r"Your are \1 years old.", text, count=1)
    )

As above, first we compile a regex pattern with case insensitive flag.

Then we check if the text matches the pattern, if it does, we reference the only group in the regex pattern (age) with group number \1.

answered May 16 '18 at 07:49

Zilong Li

889
10
23

Good example. However, you don't need the check for `if re.match(...)`. If there is no match, the `re.sub` call is essentially a no op. – Yu Chen Apr 12 '22 at 03:29
[Docs for re.sub](https://docs.python.org/3/library/re.html): "Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. **If the pattern isn’t found, string is returned unchanged**" – Yu Chen Apr 12 '22 at 03:30

Why does re.sub replace the entire pattern, not just a capturing group within it?

4 Answers4

Linked

Related