How to best extract the following content in html string in python?

Question

Assuming I have the following string with line breaks:

<table>
<tr>
<td valign="top"><a href="ABext.html">House Exterior:</a></td><td>Round</td>
</tr>
<tr>
<td>EF</td><td><a href="AB.html">House AB</a></td></tr>
<tr>
<td valign="top">Settlement Date:</td>
<td valign="top">2/3/2013</td>
</tr>
</table>

What is the best way to create a simple python dictionary with the following:

I want to extract the Settlement Date into a dict or some kind of regex match. What is the best way to do this?

NOTE: A sample in some utility is fine, but am looking for a better way than to have a variable that has contains text like this and having to go through a lot of .next.next.next.next.next until I finally get to settlement date, which is why I posted this question in the first place.

Check first answer: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — Ruben Bermudez, Apr 01 '14 at 01:27
Why have you chosen regexes? They are the wrong tool for the job of parsing HTML. Better tools exist like ... the built-in [HTMLParser](https://docs.python.org/2/library/htmlparser.html) and the third party [Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/) — TessellatingHeckler, Apr 01 '14 at 01:29

score 1 · Accepted Answer · answered Apr 01 '14 at 01:48

1

If the data is highly regular, then a regex isn't a bad choice. Here's a straight-forward approach:

regex = re.compile(r'>Settlement Date:</td>[^>]*>([^<]*)')
match = regex.search(data)
print match.group(1)

answered Apr 01 '14 at 01:48

Benji York

2,044
16
20

How to best extract the following content in html string in python?

1 Answers1