Scaping a table using mechanize & beautiful when some rows contain additional formatting

Question

Here is what I am trying to scrape (shortened a ton to make it easily read):

<table class="sortable  row_summable stats_table" id="per_game">
<colgroup><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col></colgroup>
<thead>
<tr class="">
  <th data-stat="season" align="center"  class="tooltip sort_default_asc"  tip="If listed as single number, the year the season ended.<br>&#x2605; - Indicates All-Star for league.<br>Only on regular season tables.">Season</th>
  <th data-stat="age" align="center"  class="tooltip sort_default_asc"  tip="Age of Player at the start of February 1st of that season.">Age</th>
</tr>
</thead>
<tbody>
<tr  class="full_table" id="per_game.2009">
   <td align="left" ><a href="/players/r/rondora01/gamelog/2009/">2008-09</a></td>
   <td align="right" >22</td>
</tr>
<tr  class="full_table" id="per_game.2010">
   <td align="left" ><a href="/players/r/rondora01/gamelog/2010/">2009-10</a><span class="bold_text" style="color:#c0c0c0">&nbsp;&#x2605;</span></td>
   <td align="right" >23</td>
</tr>
</tfoot>
</table>

And here is the code that I am using:

from bs4 import BeautifulSoup
import requests
import mechanize
from mechanize import Browser
import csv

mech = Browser()
url = "http://www.basketball-reference.com/players/r/rondora01.html"
# url = "http://www.basketball-reference.com/players/r/rosede01.html"
RR = mech.open(url)

html = RR.read()
soup = BeautifulSoup(html)
table = soup.find(id="per_game")

for row in table.findAll('tr')[1:]: 
    col = row.findAll('td')
    season = col[0].string
    age = col[1].string
    team = col[2].string
    pos = col[3].string
    games_played = col[4].string
    record = (season, age, team, pos, games_played)
    print "|".join(record)

However, if you note in the HTML that in the second row, compared to the first, there is an additional span for the season. It creates a little star. My code runs find up until any row that has that additional argument, and then crashes. Thoughts on making the code flexible enough to ignore the extra span chunk?

shaktimaan · Answer 1 · 2014-08-27T21:38:43.787

I would suggest a couple of changes:

Since you are interested in the text associated with the <a> element, change your line: col[0].string to col[0].a.string. That will take care of the problem.
After the first issue is fixed, you will hit an error at the last line of that table (since it is structured differently). To fix that, change this line for row in table.findAll('tr')[1:]: to for row in table.findAll('tr')[1:-1]:. This will take care of skipping the last line.

Making the above changes:

for row in table.findAll('tr')[1:-1]: 
    col = row.findAll('td')
    season = col[0].a.string
    age = col[1].string
    team = col[2].string
    pos = col[3].string
    games_played = col[4].string
    record = (season, age, team, pos, games_played)
    print "|".join(record)

prints:

2006-07|20|BOS|NBA|PG
2007-08|21|BOS|NBA|PG
2008-09|22|BOS|NBA|PG
2009-10|23|BOS|NBA|PG
2010-11|24|BOS|NBA|PG
2011-12|25|BOS|NBA|PG
2012-13|26|BOS|NBA|PG
2013-14|27|BOS|NBA|PG

score 1 · Accepted Answer · edited May 23 '17 at 11:57

You can improve the code by, first, reading all the headers into the list, and reading all of the parameters row by row, using zip() to match each header with a value and making a dictionary:

headers = [item.text for item in table('th')]
for row in table('tr')[1:]:
    params = [item.text.strip() for item in row('td')]
    print dict(zip(headers, params))

Prints:

{u'Lg': u'NBA', u'FT': u'1.5', u'3P': u'0.1', u'TOV': u'1.8', u'2PA': u'5.4', u'Tm': u'BOS', u'FG': u'2.4', u'3PA': u'0.4', u'DRB': u'2.8', u'2P': u'2.3', u'AST': u'3.8', u'Season': u'2006-07', u'FT%': u'.647', u'PF': u'2.3', u'PTS': u'6.4', u'FGA': u'5.8', u'GS': u'25', u'G': u'78', u'STL': u'1.6', u'Age': u'20', u'TRB': u'3.7', u'FTA': u'2.4', u'BLK': u'0.1', u'FG%': u'.418', u'Pos': u'PG', u'2P%': u'.432', u'MP': u'23.5', u'ORB': u'0.9', u'3P%': u'.207'}
{u'Lg': u'NBA', u'FT': u'1.4', u'3P': u'0.1', u'TOV': u'1.9', u'2PA': u'9.0', u'Tm': u'BOS', u'FG': u'4.6', u'3PA': u'0.2', u'DRB': u'3.2', u'2P': u'4.5', u'AST': u'5.1', u'Season': u'2007-08', u'FT%': u'.611', u'PF': u'2.4', u'PTS': u'10.6', u'FGA': u'9.3', u'GS': u'77', u'G': u'77', u'STL': u'1.7', u'Age': u'21', u'TRB': u'4.2', u'FTA': u'2.3', u'BLK': u'0.2', u'FG%': u'.492', u'Pos': u'PG', u'2P%': u'.499', u'MP': u'29.9', u'ORB': u'1.0', u'3P%': u'.263'}
{u'Lg': u'NBA', u'FT': u'2.2', u'3P': u'0.2', u'TOV': u'2.6', u'2PA': u'8.9', u'Tm': u'BOS', u'FG': u'4.8', u'3PA': u'0.6', u'DRB': u'4.0', u'2P': u'4.6', u'AST': u'8.2', u'Season': u'2008-09', u'FT%': u'.642', u'PF': u'2.4', u'PTS': u'11.9', u'FGA': u'9.5', u'GS': u'80', u'G': u'80', u'STL': u'1.9', u'Age': u'22', u'TRB': u'5.2', u'FTA': u'3.4', u'BLK': u'0.1', u'FG%': u'.505', u'Pos': u'PG', u'2P%': u'.518', u'MP': u'33.0', u'ORB': u'1.3', u'3P%': u'.313'}
{u'Lg': u'NBA', u'FT': u'2.2', u'3P': u'0.2', u'TOV': u'3.0', u'2PA': u'10.2', u'Tm': u'BOS', u'FG': u'5.7', u'3PA': u'1.0', u'DRB': u'3.2', u'2P': u'5.5', u'AST': u'9.8', u'Season': u'2009-10\xa0\u2605', u'FT%': u'.621', u'PF': u'2.4', u'PTS': u'13.7', u'FGA': u'11.2', u'GS': u'81', u'G': u'81', u'STL': u'2.3', u'Age': u'23', u'TRB': u'4.4', u'FTA': u'3.5', u'BLK': u'0.1', u'FG%': u'.508', u'Pos': u'PG', u'2P%': u'.536', u'MP': u'36.6', u'ORB': u'1.2', u'3P%': u'.213'}
{u'Lg': u'NBA', u'FT': u'1.1', u'3P': u'0.1', u'TOV': u'3.4', u'2PA': u'9.2', u'Tm': u'BOS', u'FG': u'4.7', u'3PA': u'0.6', u'DRB': u'3.1', u'2P': u'4.5', u'AST': u'11.2', u'Season': u'2010-11\xa0\u2605', u'FT%': u'.568', u'PF': u'1.8', u'PTS': u'10.6', u'FGA': u'9.9', u'GS': u'68', u'G': u'68', u'STL': u'2.3', u'Age': u'24', u'TRB': u'4.4', u'FTA': u'1.9', u'BLK': u'0.2', u'FG%': u'.475', u'Pos': u'PG', u'2P%': u'.491', u'MP': u'37.2', u'ORB': u'1.3', u'3P%': u'.233'}
...

If you want to strip the unprintable characters from the parameter values, you can rely on string.printable:

import string

params = [filter(lambda x: x in string.printable, item.text) 
          for item in row.find_all('td')]

See also: Stripping non printable characters from a string in python

Complete code that outputs to csv (with player name):

import csv
import string
from bs4 import BeautifulSoup
from mechanize import Browser

mech = Browser()
url = "http://www.basketball-reference.com/players/r/rondora01.html"
RR = mech.open(url)

html = RR.read()
soup = BeautifulSoup(html)
table = soup.find(id="per_game")
player_name = soup.select('div#info_box h1')[0].text.strip()

with open('result.csv', 'w') as f:
    writer = csv.writer(f)

    writer.writerow(['Name'] + [item.text for item in table('th')])

    for row in table('tr')[1:]:
        writer.writerow([player_name] + [filter(lambda x: x in string.printable, item.text)
                                         for item in row('td')])

Thanks @alecxe. Perfect answer. One additional add-on -- is it then possible to, instead of printing, save them to a CSV file? Or, better yet, add them to a preexistint CSV that has the same header titles? — Craig, Aug 27 '14 at 21:44
Great, that's incredibly helpful. Is there an easy way to manually add a column to the csv before it prints out? Say I want to add "Player Name" as a heading with "Rajon Rondo" for all rows that were scraped from this table. The ultimate goal is to do this for many players, stored in a single csv. Thanks again, Alec — Craig, Aug 27 '14 at 21:56
@Craig sure, then get the player name out of the page too - I've improved the complete code at the end of the answer, check if it helps. Thanks. — alecxe, Aug 28 '14 at 13:58

Scaping a table using mechanize & beautiful when some rows contain additional formatting

2 Answers2