re.findall... finds too much! :)

Question

While experimenting with regexs in python's re.findall, I came across this problem:

line = "Lorem ipsum HELLO dolor sit amet, GOODBYE consectetuer adipiscing elit, HELLO sed diam nonummy nibh GOODBYE all"

X = re.findall("(HELLO)(.*)(GOODBYE)", line, flags=re.MULTILINE)

print (y)

This will output:

('HELLO', ' dolor sit amet, GOODBYE consectetuer adipiscing elit, HELLO sed diam nonummy nibh ', 'GOODBYE')

But what I wanted was more like...

[('HELLO', ' dolor sit amet', 'GOODBYE'), ('HELLO', 'sed diam nonummy nibh ', 'GOODBYE')]

So instead of taking them one at a time, re.findall (based upon the way I have it defined the pattern) seems to be looking for the first and last occurrences of HELLO and GOODBYE to define the list elements, and it then places everything else in between into the middle group.

Is there a way to get it how I was seeking it? I thought maybe "serializing" the HELLO and GOODBYE pairs might help, sort of like this:

line = "Lorem ipsum HELLO_1 dolor sit amet, GOODBYE_1 consectetuer adipiscing elit, HELLO_2 sed diam nonummy nibh GOODBYE_2 all"

But that seems to make the problem harder.

Any helpful ideas most appreciated!

Thank you. That worked, although not sure why. I have been using W3Schools python regex, and they dont mention the meaning of the ? — Will, Oct 24 '20 at 06:11
help(re) covers it: "?" Matches 0 or 1 (greedy) of the preceding RE — nortally, Jan 12 '21 at 17:57

DYZ · Answer 1 · 2020-10-24T06:17:10.373

5

You use a greedy .* operator. It matches as many characters as possible. Replace it with a non-greedy .*?:

x = re.findall("(HELLO)(.*?)(GOODBYE)", line, flags=re.M)
#[('HELLO', ' dolor sit amet, ', 'GOODBYE'), 
# ('HELLO', ' sed diam nonummy nibh ', 'GOODBYE')]

edited Oct 24 '20 at 06:17

answered Oct 24 '20 at 05:55

DYZ

55,249
10
64
93

Is that literally what it is is called? And eager .* operator? What is the .*? operator called, if it has a name? :) – Will Oct 24 '20 at 06:13
It's called greedy, my bad. – DYZ Oct 24 '20 at 06:17

re.findall... finds too much! :)

1 Answers1