How to get the content from a certain using python?

Asked Oct 02 '12 at 12:31

Active Oct 02 '12 at 15:45

Viewed 1,247 times

0

I have some `<tr>`s, like this:

<tr align=center><td>10876151</td><td><a href=userstatus?user_id=yangfanhit>yangfanhit</a></td><td><a href=problem?id=3155>3155</a></td><td>Accepted</td><td>344K</td><td>219MS</td><td>C++</td><td>3940B</td><td>2012-10-02 16:42:45</td></tr> <tr align=center><td>10876150</td><td><a href=userstatus?user_id=BandBandRock>BandBandRock</a></td><td><a href=problem?id=2503>2503</a></td><td>Accepted</td><td>16348K</td><td>2750MS</td><td>G++</td><td>840B</td><td>2012-10-02 16:42:25</td></tr>

I want to fetch the content without html tags, like:

`yangfanhit 3155 Accepted 344K 219MS C++ 3940B 2012-10-02 16:42:45`

Now I'm using the following code to deal with it:

`response = urllib2.urlopen('http://poj.org/status', timeout=10) html = response.read() response.close() pattern = re.compile(r'<tr align.</tr>') match = pattern.findall(html) pat = re.compile(r'<td>.?</td>') p = re.compile(r'<[/]?.*?>') for item in match: for i in pat.findall(item): print p.sub(r'', i) print '================================================='`

I'm new to regex and also new to python. So could you suggest some better methods to process it?

python regex

asked Oct 02 '12 at 12:31
abcdabcd987

2,043

2

23

34

2

possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Chinmay Kanchi Oct 02 '12 at 12:38

2

Don't parse HTML with RegEx. Tony the Pony will eat you alive. Please use a proper parser instead. lxml comes built in to Python. – Chinmay Kanchi Oct 02 '12 at 12:40

possible duplicate of [Strip html from strings in python](http://stackoverflow.com/questions/753052/strip-html-from-strings-in-python) – ekhumoro Dec 03 '12 at 00:27

4 Answers4

1

You could use `BeautifulSoup` to parse the html. To write the content of the table in csv format:

`#!/usr/bin/env python import csv import sys import urllib2 from bs4 import BeautifulSoup # $ pip install beautifulsoup4 soup = BeautifulSoup(urllib2.urlopen('http://poj.org/status')) writer = csv.writer(sys.stdout) for tr in soup.find('table', 'a')('tr'): writer.writerow([td.get_text() for td in tr('td')])`

Output

`Run ID,User,Problem,Result,Memory,Time,Language,Code Length,Submit Time 10876151,yangfanhit,3155,Accepted,344K,219MS,C++,3940B,2012-10-02 16:42:45 10876150,BandBandRock,2503,Accepted,16348K,2750MS,G++,840B,2012-10-02 16:42:25`

answered Oct 02 '12 at 12:51
jfs

399,953

195

994

1,670

1

Also take a look at PyQuery. Very easy to pickup if you're familiar with jQuery. Here's an example that returns table header and data as list of dictionaries.

`import itertools from pyquery import PyQuery as pq # parse html html = pq(url="http://poj.org/status") # extract header values from table header = [header.text for header in html(".a").find(".in").find("td")] # extract data values from table rows in nested list detail = [[td.text for td in tr] for tr in html(".a").children().not_(".in")] # merge header and detail to create list of dictionaries result = [dict(itertools.izip(header, values)) for values in detail]`

edited Oct 02 '12 at 15:45

Bryan

answered Oct 02 '12 at 14:02
Bryan

17,112

7

57

80

0

You really don't need to work with regex directly to parse html, see answer here.

Or see Dive into Python Chapter 8 about HTML Processing.

edited May 23 '17 at 12:18
Community

1

1

answered Oct 02 '12 at 12:37
oz123

27,559

27

125

187

0

Why you are doing those things when you already got HTML/ XML parsers which does the job easily for you

Use BeautifulSoup. Considering what you want as mentioned in the above question, it can be done in 2-3 lines of code.

Example:

>>> from bs4 import BeautifulSoup as bs >>> html = """ <tr align=center><td>10876151</td><td><a href=userstatus?user_id=yangfanhit>yangfanhit</a></td><td><a href=problem?id=3155>3155</a></td><td>Accepted</td><td>344K</td><td>219MS</td><td>C++</td><td>3940B</td><td>2012-10-02 16:42:45</td></tr> <tr align=center><td>10876150</td><td><a href=userstatus?user_id=BandBandRock>BandBandRock</a></td><td><a href=problem?id=2503>2503</a></td><td>Accepted</td><td>16348K</td><td>2750MS</td><td>G++</td><td>840B</td><td>2012-10-02 16:42:25</td></tr> """ >>>soup = bs(html) >>>soup.td >>><td>10876151</td>

edited Oct 02 '12 at 12:53

Surya

answered Oct 02 '12 at 12:47
Surya

4,824

6

38

63

Question

I have some <tr>s, like this:

<tr align=center><td>10876151</td><td><a href=userstatus?user_id=yangfanhit>yangfanhit</a></td><td><a href=problem?id=3155>3155</a></td><td><font color=blue>Accepted</font></td><td>344K</td><td>219MS</td><td>C++</td><td>3940B</td><td>2012-10-02 16:42:45</td></tr>
<tr align=center><td>10876150</td><td><a href=userstatus?user_id=BandBandRock>BandBandRock</a></td><td><a href=problem?id=2503>2503</a></td><td><font color=blue>Accepted</font></td><td>16348K</td><td>2750MS</td><td>G++</td><td>840B</td><td>2012-10-02 16:42:25</td></tr>

I want to fetch the content without html tags, like:

yangfanhit
3155
Accepted
344K
219MS
C++
3940B
2012-10-02 16:42:45

Now I'm using the following code to deal with it:

response = urllib2.urlopen('http://poj.org/status', timeout=10)
html = response.read()
response.close()

pattern = re.compile(r'<tr align.*</tr>')
match = pattern.findall(html)
pat = re.compile(r'<td>.*?</td>')
p = re.compile(r'<[/]?.*?>')
for item in match:
    for i in pat.findall(item):
        print p.sub(r'', i)
    print '================================================='

I'm new to regex and also new to python. So could you suggest some better methods to process it?

possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — Chinmay Kanchi, Oct 02 '12 at 12:38
Don't parse HTML with RegEx. Tony the Pony will eat you alive. Please use a proper parser instead. lxml comes built in to Python. — Chinmay Kanchi, Oct 02 '12 at 12:40
possible duplicate of [Strip html from strings in python](http://stackoverflow.com/questions/753052/strip-html-from-strings-in-python) — ekhumoro, Dec 03 '12 at 00:27

score 1 · Accepted Answer · answered Oct 02 '12 at 12:51

You could use BeautifulSoup to parse the html. To write the content of the table in csv format:

#!/usr/bin/env python
import csv
import sys
import urllib2
from bs4 import BeautifulSoup # $ pip install beautifulsoup4

soup = BeautifulSoup(urllib2.urlopen('http://poj.org/status'))

writer = csv.writer(sys.stdout)
for tr in soup.find('table', 'a')('tr'):
    writer.writerow([td.get_text() for td in tr('td')])

Output

Run ID,User,Problem,Result,Memory,Time,Language,Code Length,Submit Time
10876151,yangfanhit,3155,Accepted,344K,219MS,C++,3940B,2012-10-02 16:42:45
10876150,BandBandRock,2503,Accepted,16348K,2750MS,G++,840B,2012-10-02 16:42:25

Bryan · Answer 2 · 2012-10-02T15:45:44.353

Also take a look at PyQuery. Very easy to pickup if you're familiar with jQuery. Here's an example that returns table header and data as list of dictionaries.

import itertools
from pyquery import PyQuery as pq

# parse html
html = pq(url="http://poj.org/status")

# extract header values from table
header = [header.text for header in html(".a").find(".in").find("td")]

# extract data values from table rows in nested list
detail = [[td.text for td in tr] for tr in html(".a").children().not_(".in")]

# merge header and detail to create list of dictionaries
result = [dict(itertools.izip(header, values)) for values in detail]

score 0 · Answer 3 · edited May 23 '17 at 12:18

0

You really don't need to work with regex directly to parse html, see answer here.

Or see Dive into Python Chapter 8 about HTML Processing.

edited May 23 '17 at 12:18

Community

1
1

answered Oct 02 '12 at 12:37

oz123

27,559
27
125
187

Surya · Answer 4 · 2012-10-02T12:53:35.493

Why you are doing those things when you already got HTML/ XML parsers which does the job easily for you

Use BeautifulSoup. Considering what you want as mentioned in the above question, it can be done in 2-3 lines of code.

Example:

>>> from bs4 import BeautifulSoup as bs
>>> html = """
<tr align=center><td>10876151</td><td><a href=userstatus?user_id=yangfanhit>yangfanhit</a></td><td><a href=problem?id=3155>3155</a></td><td><font color=blue>Accepted</font></td><td>344K</td><td>219MS</td><td>C++</td><td>3940B</td><td>2012-10-02 16:42:45</td></tr>
<tr align=center><td>10876150</td><td><a href=userstatus?user_id=BandBandRock>BandBandRock</a></td><td><a href=problem?id=2503>2503</a></td><td><font color=blue>Accepted</font></td><td>16348K</td><td>2750MS</td><td>G++</td><td>840B</td><td>2012-10-02 16:42:25</td></tr>
"""

>>>soup = bs(html)
>>>soup.td
>>><td>10876151</td>