Regular expression to match everything after two words

Question

I've been trying to use regular expressions to remove a part of a string.

Heroes Chapter 91 - Rescue

I need to remove everything after "Chapter -number-", I can't remove everything after "-" because I'm not sure if the title is always gonna be "Heroes" so, if the title is "-New- Spiderman", it'll remove the wrong part. Same goes with the "-", if it removes everything after a "-", it might remove the wrong part. It has to be "Chapter -number-". I don't know if I explained it well.

However, I've tried doing it like this:

title = "Heroes Chapter 91 - Rescue"
title = re.sub('Chapter \d+ (\D+)', '', title)

but it returns Heroes.

title = "Heroes Chapter 91 - Rescue"
title = re.sub('Chapter (\d+).*', '', title)

but it returns Heroes, again.

Any ideas?

PD: Someone linked me to this question but I can't find the solution there, if someone sees it, please point it out. I'm clearly not an expert :)

Final solution:

title = "Heroes Chapter 91 - Rescue"
title = re.sub('(Chapter \d+).*', '\\1', title)

hwnd · Accepted Answer · 2014-06-02T19:06:24.273

3

You can use a capturing group ( ) here and reference the captured group in your replacement.

>>> re.sub('(Chapter \d+).*', '\\1', title)
'Heroes Chapter 91'

edited Jun 02 '14 at 19:06

answered Jun 02 '14 at 18:59

hwnd

69,796
4
95
132

Luis Masuelli · Answer 2 · 2014-06-02T19:11:01.580

1

Of course it will. re.sub REPLACES the matched part in the whole string. The matched part is "Chapter 91 - Rescue" since it completely matches the pattern 'Chapter (\d+) (\D+)', and then you replace it entirely with '', so it's removed. The only unmatched part is 'Heroes'.

You can match everything again but instead of returning '' you could return a part of the matched string:

re.sub('(Chapter \d+).*', '\\1', title)

with that, you're keeping only the subpattern between parens, discarding the rest. you'd be keeping "Heroes Chapter 91" with that pattern, where 'Heroes ' was not matched but 'Chapter 91 - Rescue' was, ending with a .* (greedy star operator matching 'till the end of the any-nonline-character list) for the part after the chap. number. From that match, only 'Chapter 91' is kept because it matches the first subpattern (the only one in parens), and only that subpattern is get and replaced the original part. That's how you keep 'Heroes ' + 'Chapter 91' (discarding the trailing part - the actual title)

edited Jun 02 '14 at 19:11

answered Jun 02 '14 at 18:56

Luis Masuelli

12,079
10
49
87

I know. Read again please. – Luis Masuelli Jun 02 '14 at 19:04
Read again both the question AND the answer. He wants to keep "Heores Chapter 91" discarding the rest. – Luis Masuelli Jun 02 '14 at 19:04
Thanks for the help, I tested it and it returns "Heroes" for some reason. Same happens with cheshicat's answer, unless I'm doing something wrong. >>> title = "Heroes Chapter 91 - Rescue" >>> import re >>> re.sub('(Chapter \d+).*', '\1', title) 'Heroes \x01' – Jesús León Jun 02 '14 at 19:08
debug your string before and after the regex-matching (i.e. use breakpoint or a plain-old print in console) – Luis Masuelli Jun 02 '14 at 19:11

score 1 · Answer 3 · answered Jun 02 '14 at 18:56

1

Try

title = re.sub('(Chapter \d+) .*', '\1', title)

answered Jun 02 '14 at 18:56

cheshircat

183
1
10

score 1 · Answer 4 · answered Jun 02 '14 at 18:57

1

Try using a lookbehind:

re.sub('(?<=Chapter \d+) - .*', '', title)

If re doesn't support quantifiers in the lookbehind, go with cheshircat's solution.

answered Jun 02 '14 at 18:57

tenub

3,386
1
16
25

It returns an error, I guess it doesn't support it :( – Jesús León Jun 02 '14 at 19:03
Perhaps it doesn't support the quantifiers. – Luis Masuelli Jun 02 '14 at 19:05

Regular expression to match everything after two words

4 Answers4