Get substring between strings. But the start occurs multiple times

Question

If I have the following string:

s = 'sdsdsdBetreft:ddddddBetreft:HOOOIIIIgagaga'

How do I get the HOOOIIII?

I tried the following:

p = re.search(r'Betreft:(.*?)gagaga', s).group(1)
print(p)

But that gives me:

ddddddBetreft:HOOOIIII

This is because 'Betreft' occurs multiple times. I'm lost.

Any tips?

I'm also lost. What is the logic by which `HOOOIIII` gets targeted within your larger string? — Tim Biegeleisen, Sep 28 '18 at 02:17
@TimBiegeleisen Sorry if I confused you. But the logic is that I know I need everthing that is between Betreft: and gagaga. Which is HOOOIIII in this case — , Sep 28 '18 at 02:18
Well...technically your current code is already doing this. How do we know to target the second `Betreft` ? — Tim Biegeleisen, Sep 28 '18 at 02:19
Maybe there is something to tell the regex to look where Betreft and gagaga are the closest? — , Sep 28 '18 at 02:24

score 0 · Answer 1 · answered Sep 28 '18 at 02:20

If you want to ensure that you don't capture anything before the last Betreft, then one option is use lookarounds. Consider the following tempered dot:

(?:(?!Betreft:).)*

This says to consume anything, so long as we never lookahead and see the string Betreft. In the context of the pattern below, this is one way to avoid beginning the match at an earlier occurrence of Betreft.

s = 'sdsdsdBetreft:ddddddBetreft:HOOOIIIIgagaga'
p = re.search(r'(?<=Betreft:)(?:(?!Betreft:).)*(?=gagaga)', s).group(0)
print(p)

HOOOIIII

Demo

blhsing · Answer 2 · 2018-09-28T02:29:48.877

0

You can add .* in front of your regex to consume all the preceding occurrences of Betreft::

re.search(r'.*Betreft:(.*?)gagaga', s).group(1)

This returns: HOOOIIII

edited Sep 28 '18 at 02:29

answered Sep 28 '18 at 02:24

blhsing

91,368
6
71
106

You probably want to use lazy dot `(.*?)` here, in case `gagaga` could occur more than once. – Tim Biegeleisen Sep 28 '18 at 02:29
Indeed. Edited as suggested then. – blhsing Sep 28 '18 at 02:30

score 0 · Accepted Answer · answered Sep 28 '18 at 02:51

0

The source of your problem is that expressions like .* usually match too much text, compared to the actual intention of the regex author.

One of possible solutions is to match a sequence of chars other than :, and probably a better choice is the non-empty variant, so the central part of the regex shoud be: [^:]+.

As you defined "border strings" (before and after the matched text), use both of them as positive lookbehind and positive lookahead, so the whole regex can be:

(?<=Betreft:)[^:]+(?=gagaga)

answered Sep 28 '18 at 02:51

Valdi_Bo

30,023
4
23
41

What does [^:]+ do exactly? – Sep 28 '18 at 08:25
It means: A sequence of chars other than ":". "+" means "1 or more". Note that you put ".*?" in your regex, where "*" means "0 or more", so also the **empty** text will be matched. Do you really accept empty string? – Valdi_Bo Sep 30 '18 at 10:39

Get substring between strings. But the start occurs multiple times

3 Answers3

Demo