0

I am trying to figure out how to make regex capture a bunch of items only that come after one particular thing. I am using Python for this. One example of something like this would be using the text B <4>.<5> <6> A <1> m<2> . <3> with the intent of capturing only 1, 2, and 3. I thought a regular expression like A.*?<(.+?)> would work, but it only caputures the final 3 using Python re.findall. Can I get any help with this?

Paul
  • 779
  • 2
  • 9
  • 19
  • Are you trying to capture the 1, 2, and 3 as separate groups or one group containing all of them? – BrenBarn Oct 06 '13 at 18:30
  • possible duplicate of [Python regex multiple groups](http://stackoverflow.com/questions/4963691/python-regex-multiple-groups) – BrenBarn Oct 06 '13 at 18:33
  • It doesn't matter to me, but I was originally trying to make them in separate groups. – Paul Oct 06 '13 at 18:35

3 Answers3

2

The regex module (going to replace re in future pythons) supports variable lookbehinds, which makes it fairly easy:

s = "B <4>.<5> <6> A23 <1> m<2> . <3>"

import regex
print regex.findall(r'(?<=A\d+.*)<.+?>', s)
# ['<1>', '<2>', '<3>']

(I'm using A\d+ instead of just A to make thing interesting). If you're bound to the stock re, you're forced to ugly workarounds like this:

import re
print re.findall(r'(<[^<>]+>)(?=(?:.(?!A\d+))*$)', s)
# ['<1>', '<2>', '<3>']

or pre-splitting:

print re.findall(r'<.+?>', re.split(r'A\d+', s)[-1])
georg
  • 211,518
  • 52
  • 313
  • 390
1

It would be easier with a variable width lookbehind, but an alternate might be to make sure there's no A after the parts you're matching so that you can use something like:

re.findall(r'<(.+?)>(?![^A]*A[^A]*$)', 'B <4>.<5> <6> A <1> m<2> . <3>')

But here's a problem here... (.+?) accepts anything which can break what you're looking for. You can use a negated class: [^>]+ instead of .+?.

This means:

re.findall(r'<([^>]+)>(?![^A]*A[^A]*$)', 'B <4>.<5> <6> A <1> m<2> . <3>')

regex101 demo

(?![^A]*A[^A]*$) makes sure there's no A ahead of the part you're capturing.

(?! ... ) is a negative lookahead which makes the match fail if what's inside is matched.

[^A]* matches any character except A

$ matches the end of the string.

Jerry
  • 70,495
  • 13
  • 100
  • 144
1

As it currently stands, your code is matching text between < and > that comes after A followed by zero or more characters. Furthermore, the only part of your text that fulfills this condition is <1> (which is why that is all that gets returned).

There are many ways to fix this problem, but I think the most straightforward is to first split on A, then use <(.+?)>:

>>> from re import findall, split
>>> text = 'B <4>.<5> <6> A <1> m<2> . <3>'
>>> text = split('A', 'B <4>.<5> <6> A <1> m<2> . <3>')
>>> text
['B <4>.<5> <6> ', ' <1> m<2> . <3>']
>>> text = text[1]
>>> text
' <1> m<2> . <3>'
>>> text = findall('<(.+?)>', text)
>>> text
['1', '2', '3']
>>>

Above is a step-by-step demonstration. Below is the code you will want:

>>> text = 'B <4>.<5> <6> A <1> m<2> . <3>'
>>> findall('<(.+?)>', split('A', text)[1])
['1', '2', '3']
>>>
  • Isnt' it the other way around? (?.+) instead of (.+?) ? I think you are trying to make a "non-greedy" search. Am I right?. EDIT: You are right. It's (.+?) according to Python's reference. – Robson França Oct 06 '13 at 18:36
  • No. The way I put it makes it a non-greedy match. –  Oct 06 '13 at 18:37
  • @RobsonFrança `(?.+)` is not valid regex. `(?:.+)` maybe, but not `(?.+)`. – Jerry Oct 06 '13 at 18:39