Python re.sub use non-greedy mode (.*?) with end of string ($) it comes greedy!

Question

Code:

str = '<br><br />A<br />B'
print(re.sub(r'<br.*?>\w$', '', str))

It is expected to return  A, but it returns an empty string ''!

Any suggestion?

Uh... hey... you're not parsing HTML with regular expressions, are you? — detly, Nov 25 '10 at 06:59
If you need to parse a lot of HTML then you'd be better of using something like http://www.crummy.com/software/BeautifulSoup/ instead of regex. — Matti Pastell, Nov 25 '10 at 07:00
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — ephemient, Nov 25 '10 at 08:50

score 7 · Accepted Answer · answered Nov 25 '10 at 05:57

Greediness works from left to right, but not otherwise. It basically means "don't match unless you failed to match". Here's what's going on:

The regex engine matches <br at the start of the string.
.*? is ignored for now, it is lazy.
Try to match >, and succeeds.
Try to match \w and fails. Now it's interesting - the engine starts backtracking, and sees the .*? rule. In this case, . can match the first >, so there's still hope for that match.
This keep happening until the regex reaches the slash. Then >\w can match, but $ fails. Again, the engine comes back to the lazy .* rule, and keeps matching, until it matches A B

Luckily, there's an easy solution: By replacing <br[^>]*>\w$ you don't allow matching outside of your tags, so it should replace the last occurrence.
Strictly speaking, this doesn't work well for HTML, because tag attributes can contain > characters, but I assume it's just an example.

score 1 · Answer 2 · answered Nov 25 '10 at 05:56

The non-greediness won't start later on like that. It matches the first <br and will non-greedily match the rest, which actually need to go to the end of the string because you specify the $.

To make it work the way you wanted, use

/<br[^<]*?>\w$/

but usually, it is not recommended to use regex to parse HTML, as some attribute's value can have < or > in it.

Python re.sub use non-greedy mode (.*?) with end of string ($) it comes greedy!

2 Answers2

Linked

Related