0

I have some <tr>s, like this:

<tr align=center><td>10876151</td><td><a href=userstatus?user_id=yangfanhit>yangfanhit</a></td><td><a href=problem?id=3155>3155</a></td><td><font color=blue>Accepted</font></td><td>344K</td><td>219MS</td><td>C++</td><td>3940B</td><td>2012-10-02 16:42:45</td></tr>
<tr align=center><td>10876150</td><td><a href=userstatus?user_id=BandBandRock>BandBandRock</a></td><td><a href=problem?id=2503>2503</a></td><td><font color=blue>Accepted</font></td><td>16348K</td><td>2750MS</td><td>G++</td><td>840B</td><td>2012-10-02 16:42:25</td></tr>

I want to fetch the content without html tags, like:

yangfanhit
3155
Accepted
344K
219MS
C++
3940B
2012-10-02 16:42:45

Now I'm using the following code to deal with it:

response = urllib2.urlopen('http://poj.org/status', timeout=10)
html = response.read()
response.close()

pattern = re.compile(r'<tr align.*</tr>')
match = pattern.findall(html)
pat = re.compile(r'<td>.*?</td>')
p = re.compile(r'<[/]?.*?>')
for item in match:
    for i in pat.findall(item):
        print p.sub(r'', i)
    print '================================================='

I'm new to regex and also new to python. So could you suggest some better methods to process it?

abcdabcd987
  • 2,043
  • 2
  • 23
  • 34
  • 2
    possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Chinmay Kanchi Oct 02 '12 at 12:38
  • 2
    Don't parse HTML with RegEx. Tony the Pony will eat you alive. Please use a proper parser instead. lxml comes built in to Python. – Chinmay Kanchi Oct 02 '12 at 12:40
  • possible duplicate of [Strip html from strings in python](http://stackoverflow.com/questions/753052/strip-html-from-strings-in-python) – ekhumoro Dec 03 '12 at 00:27

4 Answers4

1

You could use BeautifulSoup to parse the html. To write the content of the table in csv format:

#!/usr/bin/env python
import csv
import sys
import urllib2
from bs4 import BeautifulSoup # $ pip install beautifulsoup4

soup = BeautifulSoup(urllib2.urlopen('http://poj.org/status'))

writer = csv.writer(sys.stdout)
for tr in soup.find('table', 'a')('tr'):
    writer.writerow([td.get_text() for td in tr('td')])

Output

Run ID,User,Problem,Result,Memory,Time,Language,Code Length,Submit Time
10876151,yangfanhit,3155,Accepted,344K,219MS,C++,3940B,2012-10-02 16:42:45
10876150,BandBandRock,2503,Accepted,16348K,2750MS,G++,840B,2012-10-02 16:42:25
jfs
  • 399,953
  • 195
  • 994
  • 1,670
1

Also take a look at PyQuery. Very easy to pickup if you're familiar with jQuery. Here's an example that returns table header and data as list of dictionaries.

import itertools
from pyquery import PyQuery as pq

# parse html
html = pq(url="http://poj.org/status")

# extract header values from table
header = [header.text for header in html(".a").find(".in").find("td")]

# extract data values from table rows in nested list
detail = [[td.text for td in tr] for tr in html(".a").children().not_(".in")]

# merge header and detail to create list of dictionaries
result = [dict(itertools.izip(header, values)) for values in detail]
Bryan
  • 17,112
  • 7
  • 57
  • 80
0

You really don't need to work with regex directly to parse html, see answer here.

Or see Dive into Python Chapter 8 about HTML Processing.

Community
  • 1
  • 1
oz123
  • 27,559
  • 27
  • 125
  • 187
0

Why you are doing those things when you already got HTML/ XML parsers which does the job easily for you

Use BeautifulSoup. Considering what you want as mentioned in the above question, it can be done in 2-3 lines of code.

Example:

>>> from bs4 import BeautifulSoup as bs
>>> html = """
<tr align=center><td>10876151</td><td><a href=userstatus?user_id=yangfanhit>yangfanhit</a></td><td><a href=problem?id=3155>3155</a></td><td><font color=blue>Accepted</font></td><td>344K</td><td>219MS</td><td>C++</td><td>3940B</td><td>2012-10-02 16:42:45</td></tr>
<tr align=center><td>10876150</td><td><a href=userstatus?user_id=BandBandRock>BandBandRock</a></td><td><a href=problem?id=2503>2503</a></td><td><font color=blue>Accepted</font></td><td>16348K</td><td>2750MS</td><td>G++</td><td>840B</td><td>2012-10-02 16:42:25</td></tr>
"""

>>>soup = bs(html)
>>>soup.td
>>><td>10876151</td>
Surya
  • 4,824
  • 6
  • 38
  • 63