1

If I have the following string:

s = 'sdsdsdBetreft:ddddddBetreft:HOOOIIIIgagaga'

How do I get the HOOOIIII?

I tried the following:

p = re.search(r'Betreft:(.*?)gagaga', s).group(1)
print(p)

But that gives me:

ddddddBetreft:HOOOIIII

This is because 'Betreft' occurs multiple times. I'm lost.

Any tips?

blhsing
  • 91,368
  • 6
  • 71
  • 106
  • I'm also lost. What is the logic by which `HOOOIIII` gets targeted within your larger string? – Tim Biegeleisen Sep 28 '18 at 02:17
  • @TimBiegeleisen Sorry if I confused you. But the logic is that I know I need everthing that is between Betreft: and gagaga. Which is HOOOIIII in this case –  Sep 28 '18 at 02:18
  • Well...technically your current code is already doing this. How do we know to target the second `Betreft` ? – Tim Biegeleisen Sep 28 '18 at 02:19
  • That's my point. I have no idea –  Sep 28 '18 at 02:23
  • Maybe there is something to tell the regex to look where Betreft and gagaga are the closest? –  Sep 28 '18 at 02:24

3 Answers3

0

If you want to ensure that you don't capture anything before the last Betreft, then one option is use lookarounds. Consider the following tempered dot:

(?:(?!Betreft:).)*

This says to consume anything, so long as we never lookahead and see the string Betreft. In the context of the pattern below, this is one way to avoid beginning the match at an earlier occurrence of Betreft.

s = 'sdsdsdBetreft:ddddddBetreft:HOOOIIIIgagaga'
p = re.search(r'(?<=Betreft:)(?:(?!Betreft:).)*(?=gagaga)', s).group(0)
print(p)

HOOOIIII

Demo

Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
0

You can add .* in front of your regex to consume all the preceding occurrences of Betreft::

re.search(r'.*Betreft:(.*?)gagaga', s).group(1)

This returns: HOOOIIII

blhsing
  • 91,368
  • 6
  • 71
  • 106
0

The source of your problem is that expressions like .* usually match too much text, compared to the actual intention of the regex author.

One of possible solutions is to match a sequence of chars other than :, and probably a better choice is the non-empty variant, so the central part of the regex shoud be: [^:]+.

As you defined "border strings" (before and after the matched text), use both of them as positive lookbehind and positive lookahead, so the whole regex can be:

(?<=Betreft:)[^:]+(?=gagaga)
Valdi_Bo
  • 30,023
  • 4
  • 23
  • 41
  • What does [^:]+ do exactly? –  Sep 28 '18 at 08:25
  • It means: A sequence of chars other than ":". "+" means "1 or more". Note that you put ".*?" in your regex, where "*" means "0 or more", so also the **empty** text will be matched. Do you really accept empty string? – Valdi_Bo Sep 30 '18 at 10:39