How do I extract definitions from a html file?

Question

I'm trying to practice with regular expressions by extracting function definitions from Python's standard library built-in functions page. What I do have so far is that the definitions are generally printed between <dd><p> and </dd></dl>. When I try

import re
fname = open('functions.html').read()
deflst = re.findall(r'<dd><p>([\D3]+)</dd></dl>', fhand)

it doesn't actually stop at </dd></dl>. This is probably something very silly that I'm missing here, but I've been really having a hard time trying to figure this one out.

score 2 · Accepted Answer · answered Jun 15 '17 at 22:26

Regular expressions are evaluated left to right, in a sense. So in your regular expression,

r'<dd><p>([\D3]+)</dd></dl>'

the regex engine will first look for a <dd><p>, then it will look at each of the following characters in turn, checking each for whether it's a nondigit or 3, and if so, add it to the match. It turns out that all the characters in </dd></dl> are in the class "nondigit or 3", so all of them get added to the portion matched by [\D3]+, and the engine dutifully keeps going. It will only stop when it finds a character that is a digit other than 3, and then go on and "notice" the rest of the regex (the </dd></dl>).

To fix this, you can use the reluctant quantifier like so:

r'<dd><p>([\D3]+?)</dd></dl>'

(note the added ?) which means the regex engine should be conservative in how much it adds to the match. Instead of trying to "gobble" as many characters as possible, it will now try to match the [\D3]+? to just one character and then go on and see if the rest of the regex matches, and if not it will try to match [\D3]+? with just two characters, and so on.

Basically, [\D3]+ matches the longest possible string of [\D3]'s that it can while still letting the full regex match, whereas [\D3]+? matches the shortest possible string of [\D3]'s that it can while still letting the full regex match.

Of course one shouldn't really be using regular expressions to parse HTML in "the real world", but if you just want to practice regular expressions, this is probably as good a text sample as any.

Ah thank you so much for the explanation! Very helpful! – Rebecca Noel Jun 15 '17 at 23:55 — Rebecca Noel, Jun 15 '17 at 23:55

score 1 · Answer 2 · answered Jun 15 '17 at 22:23

By default all quantifiers are greedy which means they want to match as many characters as possible. You can use ? after quantifier to make it lazy which matches as few characters as possible. \d+? matches at least one digit, but as few as possible.

Try r'<dd><p>([\D3]+?)</dd></dl>'

How do I extract definitions from a html file?

2 Answers2