Regular expressions are evaluated left to right, in a sense. So in your regular expression,
r'<dd><p>([\D3]+)</dd></dl>'
the regex engine will first look for a <dd><p>
, then it will look at each of the following characters in turn, checking each for whether it's a nondigit or 3
, and if so, add it to the match. It turns out that all the characters in </dd></dl>
are in the class "nondigit or 3
", so all of them get added to the portion matched by [\D3]+
, and the engine dutifully keeps going. It will only stop when it finds a character that is a digit other than 3
, and then go on and "notice" the rest of the regex (the </dd></dl>
).
To fix this, you can use the reluctant quantifier like so:
r'<dd><p>([\D3]+?)</dd></dl>'
(note the added ?
) which means the regex engine should be conservative in how much it adds to the match. Instead of trying to "gobble" as many characters as possible, it will now try to match the [\D3]+?
to just one character and then go on and see if the rest of the regex matches, and if not it will try to match [\D3]+?
with just two characters, and so on.
Basically, [\D3]+
matches the longest possible string of [\D3]
's that it can while still letting the full regex match, whereas [\D3]+?
matches the shortest possible string of [\D3]
's that it can while still letting the full regex match.
Of course one shouldn't really be using regular expressions to parse HTML in "the real world", but if you just want to practice regular expressions, this is probably as good a text sample as any.