BeautifulSoup HTML table parsing

Question

I am trying to parse information (html tables) from this site: http://www.511virginia.org/RoadConditions.aspx?j=All&r=1

Currently I am using BeautifulSoup and the code I have looks like this

from mechanize import Browser
from BeautifulSoup import BeautifulSoup

mech = Browser()

url = "http://www.511virginia.org/RoadConditions.aspx?j=All&r=1"
page = mech.open(url)

html = page.read()
soup = BeautifulSoup(html)

table = soup.find("table")

rows = table.findAll('tr')[3]

cols = rows.findAll('td')

roadtype = cols[0].string
start = cols.[1].string
end = cols[2].string
condition = cols[3].string
reason = cols[4].string
update = cols[5].string

entry = (roadtype, start, end, condition, reason, update)

print entry

The issue is with the start and end columns. They just get printed as "None"

Output:

(u'Rt. 613N (Giles County)', None, None, u'Moderate', u'snow or ice', u'01/13/2010 10:50 AM')

I know that they get stored in the columns list, but it seems that the extra link tag is messing up the parsing with the original html looking like this:

<td headers="road-type" class="ConditionsCellText">Rt. 613N (Giles County)</td>
<td headers="start" class="ConditionsCellText"><a href="conditions.aspx?lat=37.43036753&long=-80.51118005#viewmap">Big Stony Ck Rd; Rt. 635E/W (Giles County)</a></td>
<td headers="end" class="ConditionsCellText"><a href="conditions.aspx?lat=37.43036753&long=-80.51118005#viewmap">Cabin Ln; Rocky Mount Rd; Rt. 721E/W (Giles County)</a></td>
<td headers="condition" class="ConditionsCellText">Moderate</td>
<td headers="reason" class="ConditionsCellText">snow or ice</td>
<td headers="update" class="ConditionsCellText">01/13/2010 10:50 AM</td>

so what should be printed is:

(u'Rt. 613N (Giles County)', u'Big Stony Ck Rd; Rt. 635E/W (Giles County)', u'Cabin Ln; Rocky Mount Rd; Rt. 721E/W (Giles County)', u'Moderate', u'snow or ice', u'01/13/2010 10:50 AM')

Any suggestions or help is appreciated, and thank you in advance.

You don't have to use Beautiful Soup for that. You could use python3 htmlparser: https://github.com/schmijos/html-table-parser-python3/blob/master/html_table_parser/parser.py — schmijos, Mar 11 '14 at 08:18

Antony Hatchkins · Answer 1 · 2012-04-03T06:39:26.450

33

start = cols[1].find('a').string

or simpler

start = cols[1].a.string

or better

start = str(cols[1].find(text=True))

and

entry = [str(x) for x in cols.findAll(text=True)]

edited Apr 03 '12 at 06:39

answered Jan 13 '10 at 18:56

Antony Hatchkins

31,947
10
111
111

I went with the str(cols...) method. Thank you. – Stephen Tanner Jan 14 '10 at 16:19
21

Welcome ) It'd be good if you accepted an answer if you find it helpful – Antony Hatchkins Jan 14 '10 at 17:05
1

I agree, @Stephon Tanner pls return and accept this as an answer – Neil Apr 11 '11 at 08:50

score 2 · Answer 2 · answered Jan 18 '14 at 14:05

I was trying to reproduce your error, but the source html page was changed.

About the error, I had a similar problem, trying to reproduce the example is here

changing the proposed URL for a Wikipedia Table

I fixed it moving to BeautifulSoup4

from bs4 import BeautifulSoup

and changing the .string for .get_text()

start = cols[1].get_text()

I couldn't test with your example (as I said before, I couldn't reproduce the error) but I think it could be useful for people are looking for a solution to this problem.

BeautifulSoup HTML table parsing

2 Answers2

Linked