11

I believe that re.sub() is replacing the Full Match, but in this case I only want to replace the matching groups and ignore the non-capturing groups. How can I go about this?

string = 'aBCDeFGH'

print(re.sub('(a)?(?:[A-Z]{3})(e)?(?:[A-Z]{3})', '+', string))

output is :

+

Expected output is:

+BCD+FGH
Darwin
  • 121
  • 1
  • 4
  • 2
    Try [`re.sub('[ae]([A-Z]{3})', r'+\1', 'aBCDeFGH')`](http://rextester.com/CUOY83316) – Wiktor Stribiżew Mar 28 '18 at 07:06
  • 4
    Try `re.sub('(a)?([A-Z]{3})(e)?([A-Z]{3})', r'+\2+\4', string)` – Sohaib Farooqi Mar 28 '18 at 07:08
  • That's they way `re.sub` works... if you want to keep portions of the original string you can always put them in the replacement string using groups. – Giacomo Alzetta Mar 28 '18 at 07:09
  • 3
    Also, an alternative is to use lookaheads: `re.sub(r'[a-z](?=[A-Z]{3})', '+', string)` this will match a single lowercase character, only if it is followed by 3 uppercase ones, and in that case it replaces it with `+`, which is what you want. – Giacomo Alzetta Mar 28 '18 at 07:12

1 Answers1

10

The general solution for such problems is using a lambda in the replacement:

string = 'aBCDeFGH'

print(re.sub('(a)?([A-Z]{3})(e)?([A-Z]{3})', lambda match: '+%s+%s' % (match.group(2), match.group(4)), string))

However, as bro-grammer has commented, you can use backreferences in this case:

print(re.sub('(a)?([A-Z]{3})(e)?([A-Z]{3})', r'+\2+\4', string))
pts
  • 80,836
  • 20
  • 110
  • 183
  • Thanks! This solved my problem. Python documentation never mentions anything about being able to use lambda function in re.sub(): – Darwin Mar 28 '18 at 07:18
  • @Darwin From [the docs](https://docs.python.org/3/library/re.html#re.sub): _"repl can be a string or a function"_. There's even an example. – Aran-Fey Mar 28 '18 at 07:25
  • 1
    For a fuller answer, another solution would be to use non consuming groups (look aheads and look behinds, as giacomo stated – Veltzer Doron Mar 28 '18 at 08:23