I am trying to figure out how to make regex capture a bunch of items only that come after one particular thing. I am using Python for this. One example of something like this would be using the text B <4>.<5> <6> A <1> m<2> . <3>
with the intent of capturing only 1, 2, and 3. I thought a regular expression like A.*?<(.+?)>
would work, but it only caputures the final 3 using Python re.findall
. Can I get any help with this?

- 779
- 2
- 9
- 19
-
Are you trying to capture the 1, 2, and 3 as separate groups or one group containing all of them? – BrenBarn Oct 06 '13 at 18:30
-
possible duplicate of [Python regex multiple groups](http://stackoverflow.com/questions/4963691/python-regex-multiple-groups) – BrenBarn Oct 06 '13 at 18:33
-
It doesn't matter to me, but I was originally trying to make them in separate groups. – Paul Oct 06 '13 at 18:35
3 Answers
The regex
module (going to replace re
in future pythons) supports variable lookbehinds, which makes it fairly easy:
s = "B <4>.<5> <6> A23 <1> m<2> . <3>"
import regex
print regex.findall(r'(?<=A\d+.*)<.+?>', s)
# ['<1>', '<2>', '<3>']
(I'm using A\d+
instead of just A
to make thing interesting). If you're bound to the stock re
, you're forced to ugly workarounds like this:
import re
print re.findall(r'(<[^<>]+>)(?=(?:.(?!A\d+))*$)', s)
# ['<1>', '<2>', '<3>']
or pre-splitting:
print re.findall(r'<.+?>', re.split(r'A\d+', s)[-1])

- 211,518
- 52
- 313
- 390
It would be easier with a variable width lookbehind, but an alternate might be to make sure there's no A
after the parts you're matching so that you can use something like:
re.findall(r'<(.+?)>(?![^A]*A[^A]*$)', 'B <4>.<5> <6> A <1> m<2> . <3>')
But here's a problem here... (.+?)
accepts anything which can break what you're looking for. You can use a negated class: [^>]+
instead of .+?
.
This means:
re.findall(r'<([^>]+)>(?![^A]*A[^A]*$)', 'B <4>.<5> <6> A <1> m<2> . <3>')
(?![^A]*A[^A]*$)
makes sure there's no A
ahead of the part you're capturing.
(?! ... )
is a negative lookahead which makes the match fail if what's inside is matched.
[^A]*
matches any character except A
$
matches the end of the string.

- 70,495
- 13
- 100
- 144
As it currently stands, your code is matching text between <
and >
that comes after A
followed by zero or more characters. Furthermore, the only part of your text that fulfills this condition is <1>
(which is why that is all that gets returned).
There are many ways to fix this problem, but I think the most straightforward is to first split on A
, then use <(.+?)>
:
>>> from re import findall, split
>>> text = 'B <4>.<5> <6> A <1> m<2> . <3>'
>>> text = split('A', 'B <4>.<5> <6> A <1> m<2> . <3>')
>>> text
['B <4>.<5> <6> ', ' <1> m<2> . <3>']
>>> text = text[1]
>>> text
' <1> m<2> . <3>'
>>> text = findall('<(.+?)>', text)
>>> text
['1', '2', '3']
>>>
Above is a step-by-step demonstration. Below is the code you will want:
>>> text = 'B <4>.<5> <6> A <1> m<2> . <3>'
>>> findall('<(.+?)>', split('A', text)[1])
['1', '2', '3']
>>>
-
Isnt' it the other way around? (?.+) instead of (.+?) ? I think you are trying to make a "non-greedy" search. Am I right?. EDIT: You are right. It's (.+?) according to Python's reference. – Robson França Oct 06 '13 at 18:36
-
-
@RobsonFrança `(?.+)` is not valid regex. `(?:.+)` maybe, but not `(?.+)`. – Jerry Oct 06 '13 at 18:39