-1

I am writing a script that will parse a web page, and stick results into MySQL.

Here is an example of HTML content returned that I need to parse:

<TH ALIGN=center COLSPAN=6 BGCOLOR="#C0C0C0"><FONT SIZE="-1">Monthly Totals</FONT></TH>    </TR>
<TR><TH ALIGN=center BGCOLOR="#00805c"><FONT SIZE="-1">Hits</FONT></TH>
<TH ALIGN=center BGCOLOR="#0040ff"><FONT SIZE="-1">Files</FONT></TH>
<TH ALIGN=center BGCOLOR="#00e0ff"><FONT SIZE="-1">Pages</FONT></TH>
<TH ALIGN=center BGCOLOR="#ffff00"><FONT SIZE="-1">Visits</FONT></TH>
<TH ALIGN=center BGCOLOR="#ff8000"><FONT SIZE="-1">Sites</FONT></TH>
<TH ALIGN=center BGCOLOR="#ff0000"><FONT SIZE="-1">KBytes</FONT></TH>
<TH ALIGN=center BGCOLOR="#ffff00"><FONT SIZE="-1">Visits</FONT></TH>
<TH ALIGN=center BGCOLOR="#00e0ff"><FONT SIZE="-1">Pages</FONT></TH>
<TH ALIGN=center BGCOLOR="#0040ff"><FONT SIZE="-1">Files</FONT></TH>
<TH ALIGN=center BGCOLOR="#00805c"><FONT SIZE="-1">Hits</FONT></TH></TR>
<TR><TH HEIGHT=4></TH></TR>
<TR><TD NOWRAP><A HREF="usage_201105.html"><FONT SIZE="-1">May 2011</FONT></A></TD>
<TD ALIGN=right><FONT SIZE="-1">2529721</FONT></TD>
<TD ALIGN=right><FONT SIZE="-1">582503</FONT></TD>
<TD ALIGN=right><FONT SIZE="-1">490365</FONT></TD>
<TD ALIGN=right><FONT SIZE="-1">23301</FONT></TD>
<TD ALIGN=right><FONT SIZE="-1">17720</FONT></TD>
<TD ALIGN=right><FONT SIZE="-1">145942234</FONT></TD>
<TD ALIGN=right><FONT SIZE="-1">279618</FONT></TD>
<TD ALIGN=right><FONT SIZE="-1">5884390</FONT></TD>
<TD ALIGN=right><FONT SIZE="-1">6990042</FONT></TD>
<TD ALIGN=right><FONT SIZE="-1">30356654</FONT></TD></TR>
<TR><TD NOWRAP><A HREF="usage_201104.html"><FONT SIZE="-1">Apr 2011</FONT></A></TD>
<TD ALIGN=right><FONT SIZE="-1">2246629</FONT></TD>
<TD ALIGN=right><FONT SIZE="-1">517645</FONT></TD>
<TD ALIGN=right><FONT SIZE="-1">483787</FONT></TD>

How do I adapt the following to follow carriage returns and so on:

stats = re.findall ("Apr(.*)",content) 
halfer
  • 19,824
  • 17
  • 99
  • 186
Cmag
  • 14,946
  • 25
  • 89
  • 140

2 Answers2

6

Use BeautifulSoup, not regular expressions, to parse the HTML (cf. this famous answer)

Community
  • 1
  • 1
Will McCutchen
  • 13,047
  • 3
  • 44
  • 43
1

Use lxml, not regular expressions, to parse the HTML - as Will said, but with a different preferred tool. lxml is significantly more powerful and robust than BeautifulSoup in my experienced opinion.

Henry
  • 6,502
  • 2
  • 24
  • 30
  • Ah. I haven't used lxml's HTML parsing... is it as forgiving of bad markup as BeautifulSoup is? I usually recommend people start with BeautifulSoup because a) it is a self-contained Python file and b) it does a decent job parsing badly broken HTML. – Will McCutchen May 16 '11 at 20:53
  • @Will lxml can actually be *better* with HTML soup, you can learn more here: http://lxml.de/elementsoup.html they say it depends on the input - I say that in general, lxml performs better. – Henry May 16 '11 at 21:25