3

I'm having issues with scraping pro-football-reference.com. I'm trying to access the "Team Offense" table but can't seem to target the div/table. The best I can do is:

soup.find('div', {'id':'all_team_stats})

which doesn't return the table nor it's immediate div wrapper. The following attempts return "None":

soup.find('div', {'id':'div_team_stats'})
soup.find('table', {'id':'team_stats'})

I've already scraped different pages simply by:

soup.find('table', {'id':'table_id})

but I can't figure out why it's not working on this page. Below is the code I've been working with. Any help is much appreciated!

from bs4 import BeautifulSoup
import urllib2

def make_soup(url):
    page = urllib2.urlopen(url)
    soupdata = BeautifulSoup(page, 'lxml')
    return soupdata

def get_player_totals():
    soup = make_soup("http://www.pro-football-reference.com/years/2015/")

    tableStats = soup.find('table', {'id':'team_stats'})

    return tableStats

print get_player_totals()

EDIT:

Thanks for all the help everyone. Both of the provided solutions below have been successful. Much appreciated!

James Lim
  • 59
  • 2
  • 5

2 Answers2

4

Just remove the comments with re.sub before you pass to bs4:

from bs4 import BeautifulSoup
import urllib2
import re
comm = re.compile("<!--|-->")
def make_soup(url):
    page = urllib2.urlopen(url)
    soupdata = BeautifulSoup(comm.sub("", page.read()), 'lxml')
    return soupdata

def get_player_totals():
    soup = make_soup("http://www.pro-football-reference.com/years/2015/")

    tableStats = soup.find('table', {'id':'team_stats'})

    return tableStats

print get_player_totals()

You will see the table when you run the code.

Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
3

A couple of thoughts here: first, as mentioned in the comments, the acutal table is commented out and is not per se part of the DOM (not accessible via a parser adhoc).
In this situation, you can loop over the comments found and try to get the information via regular expressions (though this heavily discussed and mostly disliked on Stackoverflow, see here for more information). Last, but not least, I would recommend requests rather than urllib2.

That being said, here is a working code example:

from bs4 import BeautifulSoup, Comment
import requests, re

def make_soup(url):
    r = requests.get(url)
    soupdata = BeautifulSoup(r.text, 'lxml')
    return soupdata

soup = make_soup("http://www.pro-football-reference.com/years/2015/")

# get the comments
comments = soup.findAll(text=lambda text:isinstance(text, Comment))

# look for table with the id "team_stats"
rx = re.compile(r'<table.+?id="team_stats".+?>[\s\S]+?</table>')
for comment in comments:
    try:
        table = rx.search(comment.string).group(0)
        print(table)
        # break the loop if found
        break
    except:
        pass

Always favour a parser solution over this one, especially the regex part is very error-prone.

Community
  • 1
  • 1
Jan
  • 42,290
  • 8
  • 54
  • 79