Parsing block of text using Python

Question

I am writing a script that will parse a web page, and stick results into MySQL.

Here is an example of HTML content returned that I need to parse:

<TH ALIGN=center COLSPAN=6 BGCOLOR="#C0C0C0"><FONT SIZE="-1">Monthly Totals</FONT></TH>    </TR>
<TR><TH ALIGN=center BGCOLOR="#00805c"><FONT SIZE="-1">Hits</FONT></TH>
<TH ALIGN=center BGCOLOR="#0040ff"><FONT SIZE="-1">Files</FONT></TH>
<TH ALIGN=center BGCOLOR="#00e0ff"><FONT SIZE="-1">Pages</FONT></TH>
<TH ALIGN=center BGCOLOR="#ffff00"><FONT SIZE="-1">Visits</FONT></TH>
<TH ALIGN=center BGCOLOR="#ff8000"><FONT SIZE="-1">Sites</FONT></TH>
<TH ALIGN=center BGCOLOR="#ff0000"><FONT SIZE="-1">KBytes</FONT></TH>
<TH ALIGN=center BGCOLOR="#ffff00"><FONT SIZE="-1">Visits</FONT></TH>
<TH ALIGN=center BGCOLOR="#00e0ff"><FONT SIZE="-1">Pages</FONT></TH>
<TH ALIGN=center BGCOLOR="#0040ff"><FONT SIZE="-1">Files</FONT></TH>
<TH ALIGN=center BGCOLOR="#00805c"><FONT SIZE="-1">Hits</FONT></TH></TR>
<TR><TH HEIGHT=4></TH></TR>
<TR><TD NOWRAP><A HREF="usage_201105.html"><FONT SIZE="-1">May 2011</FONT></A></TD>
<TD ALIGN=right><FONT SIZE="-1">2529721</FONT></TD>
<TD ALIGN=right><FONT SIZE="-1">582503</FONT></TD>
<TD ALIGN=right><FONT SIZE="-1">490365</FONT></TD>
<TD ALIGN=right><FONT SIZE="-1">23301</FONT></TD>
<TD ALIGN=right><FONT SIZE="-1">17720</FONT></TD>
<TD ALIGN=right><FONT SIZE="-1">145942234</FONT></TD>
<TD ALIGN=right><FONT SIZE="-1">279618</FONT></TD>
<TD ALIGN=right><FONT SIZE="-1">5884390</FONT></TD>
<TD ALIGN=right><FONT SIZE="-1">6990042</FONT></TD>
<TD ALIGN=right><FONT SIZE="-1">30356654</FONT></TD></TR>
<TR><TD NOWRAP><A HREF="usage_201104.html"><FONT SIZE="-1">Apr 2011</FONT></A></TD>
<TD ALIGN=right><FONT SIZE="-1">2246629</FONT></TD>
<TD ALIGN=right><FONT SIZE="-1">517645</FONT></TD>
<TD ALIGN=right><FONT SIZE="-1">483787</FONT></TD>

How do I adapt the following to follow carriage returns and so on:

stats = re.findall ("Apr(.*)",content)

What do you want exactly ? _"and so on"_ is a little short to understand well — eyquem, May 13 '11 at 21:18

score 6 · Answer 1 · edited May 23 '17 at 12:13

6

Use BeautifulSoup, not regular expressions, to parse the HTML (cf. this famous answer)

edited May 23 '17 at 12:13

Community

1
1

answered May 13 '11 at 21:03

Will McCutchen

13,047
3
44
43

score 1 · Answer 2 · answered May 14 '11 at 04:29

1

Use lxml, not regular expressions, to parse the HTML - as Will said, but with a different preferred tool. lxml is significantly more powerful and robust than BeautifulSoup in my experienced opinion.

answered May 14 '11 at 04:29

Henry

6,502
2
24
30

Ah. I haven't used lxml's HTML parsing... is it as forgiving of bad markup as BeautifulSoup is? I usually recommend people start with BeautifulSoup because a) it is a self-contained Python file and b) it does a decent job parsing badly broken HTML. – Will McCutchen May 16 '11 at 20:53
@Will lxml can actually be *better* with HTML soup, you can learn more here: http://lxml.de/elementsoup.html they say it depends on the input - I say that in general, lxml performs better. – Henry May 16 '11 at 21:25

Parsing block of text using Python

2 Answers2