Python, regex and html: match final tag on line

Question

I'm confused about python greedy/not-greedy characters.

"Given multi-line html, return the final tag on each line."

I would think this would be correct:

re.findall('<.*?>$', html, re.MULTILINE)

I'm irked because I expected a list of single tags like:

"</html>", "<ul>", "</td>".

My O'Reilly's Pocket Reference says that *? wil "match 0 or more times, but as few times as possible."

So why am I getting 'greedier' matches, i.e., more than one tag in some (but not all) matches?

You shouldn't be using RegEx to parse HTML. You should be using an (x)html parser like BeautifulSoup or minidom. — g.d.d.c, Nov 10 '11 at 20:37
See the top-voted answer to this question: http://stackoverflow.com/questions/1732348 — Jim Garrison, Nov 10 '11 at 20:41
In the interest of brevity, I didn't mention that I was just toying around to better understand regex. I didn't realize I accidentally asked one of the most commonly mal-framed questions on SO. — MockWhy, Nov 10 '11 at 21:51

Firstrock · Accepted Answer · 2011-11-10T20:52:54.297

Your problem stems from the fact that you have an end-of-line anchor ('$'). The way non-greedy matching works is that the engine first searches for the first unconstrained pattern on the line ('<' in your case). It then looks for the first '>' character (which you have constrained, with the $ anchor, to be at the end of the line). So a non-greedy * is not any different from a greedy * in this situation.

Since you cannot remove the '$' from your RE (you are looking for the final tag on a line), you will need to take a different tack...see @Mark's answer. '<[^><]*>$' will work.

Python, regex and html: match final tag on line

1 Answers1