Using re.sub with capture groups to replace only portion of a match

Question

I have some text

>>> import re
>>> text = 'wo__RF**81@t=(181,810)'

and I want to replace the 'wo__RF' portion with '' explicitely using regular expressions. This pattern:

>>> pattern = '\A([\w]+)[@+-/*]*'

Will match and pull out the characters to remove

>>> re.findall(pattern, text)
Out[6]: ['wo__RF']

But includes the trailing operators when using re.sub

>>> re.sub(pattern, '', text)
Out[7]: '81@t=(181,810)'

How would I make this output look like this?

Out[7]: '**81@t=(181,810)'

----edit----

Modifying the pattern to:

>>> pattern = '\A([\w]+)[@+-/*]*'

produces the same output

Out[7]: '81@t=(181,810)'

---- edit 2 ----

Remove the capture groups

>>> pattern = '\A[\w]+[@+/*-]*'
>>> re.sub(pattern, '', text)
Out[11]: '81@t=(181,810)'

Actually, `[@+-/*]` must be written as `[@+/*-]` as the `-` is creating a range. However, `\w+` matches `wo__RF` and `[@+/*-]*` will match `**`. Remove `*` from the character class? `re.sub(r'^\w+[@+/-]*', '', text)`? See [this regex demo](https://regex101.com/r/CK8Jmt/1). — Wiktor Stribiżew, Aug 14 '19 at 20:03
You use capture groups for the parts you want to keep, not what you want to remove. — Barmar, Aug 14 '19 at 20:03
If the solution from the top comment does not work for you, please explain what exactly you need to remove and why. — Wiktor Stribiżew, Aug 14 '19 at 20:07
@WiktorStribiżew As yes, of course - this isn't the answer to my question but would probably come back to bite me later. Thanks. — CiaranWelsh, Aug 14 '19 at 20:09
@WiktorStribiżew It seems strange that `re.sub()` is replacing something different from what `re.findall` returns. — Barmar, Aug 14 '19 at 20:09
@Barmar This is not all strange. It is [*fine*](https://stackoverflow.com/a/31915134/3832970). So, I guess this is the answer. — Wiktor Stribiżew, Aug 14 '19 at 20:11
Since you're not using back-references, get rid of the capturing group and you'll see why this is happening. — Barmar, Aug 14 '19 at 20:14
When you have capturing groups, `re.findall` just returns the captured text, not the whole match. But `re.sub` replaces the entire match. — Barmar, Aug 14 '19 at 20:15
Why do you have `[@+-/*]*` in the pattern if you don't want to replace that? — Barmar, Aug 14 '19 at 20:16
@Barmar Because I still need to match an @ symbol or a numerical operator to distinguish it from other strings. — CiaranWelsh, Aug 14 '19 at 20:18
Thanks for the help guys, I should be able to figure it out from the question you've pointed me to. — CiaranWelsh, Aug 14 '19 at 20:19
Since you have `*` after that part of the pattern, it won't actually require one of those characters. It will just include them in the match (and remove them) if they're there. — Barmar, Aug 14 '19 at 20:22
If you need to require one of those characters after the match, use a lookahead. — Barmar, Aug 14 '19 at 20:22
@Barmar Excellant, a lookahead was what I was looking for. Feel free to post an answer and I'll accept. Thanks. — CiaranWelsh, Aug 14 '19 at 20:25
Why use lookahead for an optional pattern like `[@+/*-]*`? It makes no sense. If it is not optional, please explain what the pattern is like, the real requirements. — Wiktor Stribiżew, Aug 14 '19 at 20:27

score 0 · Accepted Answer · answered Aug 14 '19 at 20:29

0

Use a lookahead to match part of the string without replacing it.

pattern = r'\A\w+(?=[@+\-/*])'

You don't need a capture group when you're just removing the match; it's needed if you need to copy parts of the input text into the result. You also don't need [] around \w. And you should get rid of the * after [@+\-/*], since you want to require one of those characters.

You should generally use raw strings when creating regular expressions, so that the regexp escape sequences won't be confused for Python escape sequences. And you should escape - in a character set, otherwise it's used to create a range of characters.

answered Aug 14 '19 at 20:29

Barmar

741,623
53
500
612

Do not use `-` inside a character class, it is bad practice. Use `[@+/*-]` or `[-@+/*]` – Wiktor Stribiżew Aug 14 '19 at 20:30
I believe most regexp engines allow you to escape it, you don't have to use the old method of putting it in a special place. – Barmar Aug 14 '19 at 20:32
You won't be able to use it in C++ `std::regex` (I do not remember if the compiler here makes any difference) nor in any POSIX regex like `sed`, etc. Most does not mean all, that is why I say *best practice* is to use it at the start or end of the character class. – Wiktor Stribiżew Aug 14 '19 at 20:33
I don't expect a Python answer to be used for C++. – Barmar Aug 14 '19 at 20:37

Using re.sub with capture groups to replace only portion of a match

1 Answers1