Here is what I am trying to scrape (shortened a ton to make it easily read):
<table class="sortable row_summable stats_table" id="per_game">
<colgroup><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col></colgroup>
<thead>
<tr class="">
<th data-stat="season" align="center" class="tooltip sort_default_asc" tip="If listed as single number, the year the season ended.<br>★ - Indicates All-Star for league.<br>Only on regular season tables.">Season</th>
<th data-stat="age" align="center" class="tooltip sort_default_asc" tip="Age of Player at the start of February 1st of that season.">Age</th>
</tr>
</thead>
<tbody>
<tr class="full_table" id="per_game.2009">
<td align="left" ><a href="/players/r/rondora01/gamelog/2009/">2008-09</a></td>
<td align="right" >22</td>
</tr>
<tr class="full_table" id="per_game.2010">
<td align="left" ><a href="/players/r/rondora01/gamelog/2010/">2009-10</a><span class="bold_text" style="color:#c0c0c0"> ★</span></td>
<td align="right" >23</td>
</tr>
</tfoot>
</table>
And here is the code that I am using:
from bs4 import BeautifulSoup
import requests
import mechanize
from mechanize import Browser
import csv
mech = Browser()
url = "http://www.basketball-reference.com/players/r/rondora01.html"
# url = "http://www.basketball-reference.com/players/r/rosede01.html"
RR = mech.open(url)
html = RR.read()
soup = BeautifulSoup(html)
table = soup.find(id="per_game")
for row in table.findAll('tr')[1:]:
col = row.findAll('td')
season = col[0].string
age = col[1].string
team = col[2].string
pos = col[3].string
games_played = col[4].string
record = (season, age, team, pos, games_played)
print "|".join(record)
However, if you note in the HTML that in the second row, compared to the first, there is an additional span
for the season. It creates a little star. My code runs find up until any row that has that additional argument, and then crashes. Thoughts on making the code flexible enough to ignore the extra span
chunk?