1

I'm confused about python greedy/not-greedy characters.

"Given multi-line html, return the final tag on each line."

I would think this would be correct:

re.findall('<.*?>$', html, re.MULTILINE)

I'm irked because I expected a list of single tags like:

"</html>", "<ul>", "</td>".

My O'Reilly's Pocket Reference says that *? wil "match 0 or more times, but as few times as possible."

So why am I getting 'greedier' matches, i.e., more than one tag in some (but not all) matches?

MockWhy
  • 153
  • 3
  • 12
  • You shouldn't be using RegEx to parse HTML. You should be using an (x)html parser like BeautifulSoup or minidom. – g.d.d.c Nov 10 '11 at 20:37
  • See the top-voted answer to this question: http://stackoverflow.com/questions/1732348 – Jim Garrison Nov 10 '11 at 20:41
  • In the interest of brevity, I didn't mention that I was just toying around to better understand regex. I didn't realize I accidentally asked one of the most commonly mal-framed questions on SO. – MockWhy Nov 10 '11 at 21:51

1 Answers1

1

Your problem stems from the fact that you have an end-of-line anchor ('$'). The way non-greedy matching works is that the engine first searches for the first unconstrained pattern on the line ('<' in your case). It then looks for the first '>' character (which you have constrained, with the $ anchor, to be at the end of the line). So a non-greedy * is not any different from a greedy * in this situation.

Since you cannot remove the '$' from your RE (you are looking for the final tag on a line), you will need to take a different tack...see @Mark's answer. '<[^><]*>$' will work.

Firstrock
  • 931
  • 8
  • 5