Regex Capture Multiple Phrases after One

Question

I am trying to figure out how to make regex capture a bunch of items only that come after one particular thing. I am using Python for this. One example of something like this would be using the text B <4>.<5> <6> A <1> m<2> . <3> with the intent of capturing only 1, 2, and 3. I thought a regular expression like A.*?<(.+?)> would work, but it only caputures the final 3 using Python re.findall. Can I get any help with this?

Are you trying to capture the 1, 2, and 3 as separate groups or one group containing all of them? — BrenBarn, Oct 06 '13 at 18:30
possible duplicate of [Python regex multiple groups](http://stackoverflow.com/questions/4963691/python-regex-multiple-groups) — BrenBarn, Oct 06 '13 at 18:33
It doesn't matter to me, but I was originally trying to make them in separate groups. — Paul, Oct 06 '13 at 18:35

score 2 · Answer 1 · answered Oct 06 '13 at 18:49

The regex module (going to replace re in future pythons) supports variable lookbehinds, which makes it fairly easy:

s = "B <4>.<5> <6> A23 <1> m<2> . <3>"

import regex
print regex.findall(r'(?<=A\d+.*)<.+?>', s)
# ['<1>', '<2>', '<3>']

(I'm using A\d+ instead of just A to make thing interesting). If you're bound to the stock re, you're forced to ugly workarounds like this:

import re
print re.findall(r'(<[^<>]+>)(?=(?:.(?!A\d+))*$)', s)
# ['<1>', '<2>', '<3>']

or pre-splitting:

print re.findall(r'<.+?>', re.split(r'A\d+', s)[-1])

Jerry · Answer 2 · 2013-10-06T18:38:14.220

It would be easier with a variable width lookbehind, but an alternate might be to make sure there's no A after the parts you're matching so that you can use something like:

re.findall(r'<(.+?)>(?![^A]*A[^A]*$)', 'B <4>.<5> <6> A <1> m<2> . <3>')

But here's a problem here... (.+?) accepts anything which can break what you're looking for. You can use a negated class: [^>]+ instead of .+?.

This means:

re.findall(r'<([^>]+)>(?![^A]*A[^A]*$)', 'B <4>.<5> <6> A <1> m<2> . <3>')

regex101 demo

(?![^A]*A[^A]*$) makes sure there's no A ahead of the part you're capturing.

(?! ... ) is a negative lookahead which makes the match fail if what's inside is matched.

[^A]* matches any character except A

$ matches the end of the string.

score 1 · Answer 3 · 2013-10-06T19:28:01.860

1

As it currently stands, your code is matching text between < and > that comes after A followed by zero or more characters. Furthermore, the only part of your text that fulfills this condition is <1> (which is why that is all that gets returned).

There are many ways to fix this problem, but I think the most straightforward is to first split on A, then use <(.+?)>:

>>> from re import findall, split
>>> text = 'B <4>.<5> <6> A <1> m<2> . <3>'
>>> text = split('A', 'B <4>.<5> <6> A <1> m<2> . <3>')
>>> text
['B <4>.<5> <6> ', ' <1> m<2> . <3>']
>>> text = text[1]
>>> text
' <1> m<2> . <3>'
>>> text = findall('<(.+?)>', text)
>>> text
['1', '2', '3']
>>>

Above is a step-by-step demonstration. Below is the code you will want:

>>> text = 'B <4>.<5> <6> A <1> m<2> . <3>'
>>> findall('<(.+?)>', split('A', text)[1])
['1', '2', '3']
>>>

edited Oct 06 '13 at 19:28

answered Oct 06 '13 at 18:33

Isnt' it the other way around? (?.+) instead of (.+?) ? I think you are trying to make a "non-greedy" search. Am I right?. EDIT: You are right. It's (.+?) according to Python's reference. – Robson França Oct 06 '13 at 18:36
No. The way I put it makes it a non-greedy match. – Oct 06 '13 at 18:37
@RobsonFrança `(?.+)` is not valid regex. `(?:.+)` maybe, but not `(?.+)`. – Jerry Oct 06 '13 at 18:39

Regex Capture Multiple Phrases after One

3 Answers3