BeautifulSoup webscraper issue: can't find certain divs/tables

Question

I'm having issues with scraping pro-football-reference.com. I'm trying to access the "Team Offense" table but can't seem to target the div/table. The best I can do is:

soup.find('div', {'id':'all_team_stats})

which doesn't return the table nor it's immediate div wrapper. The following attempts return "None":

soup.find('div', {'id':'div_team_stats'})
soup.find('table', {'id':'team_stats'})

I've already scraped different pages simply by:

soup.find('table', {'id':'table_id})

but I can't figure out why it's not working on this page. Below is the code I've been working with. Any help is much appreciated!

from bs4 import BeautifulSoup
import urllib2

def make_soup(url):
    page = urllib2.urlopen(url)
    soupdata = BeautifulSoup(page, 'lxml')
    return soupdata

def get_player_totals():
    soup = make_soup("http://www.pro-football-reference.com/years/2015/")

    tableStats = soup.find('table', {'id':'team_stats'})

    return tableStats

print get_player_totals()

EDIT:

Thanks for all the help everyone. Both of the provided solutions below have been successful. Much appreciated!

immediately below the commented out section, there are a couple of divs in div.table_outer_container.mobile_table with classes div_team_stats_clone and div_team_stats that have the table I need. Are these divs hidden or something? — James Lim, Sep 20 '16 at 19:30
When I looked in the page source, found only one div called all_team_stats and that could be extracted. — picmate 涅, Sep 20 '16 at 19:41

score 4 · Answer 1 · answered Sep 20 '16 at 20:28

Just remove the comments with re.sub before you pass to bs4:

from bs4 import BeautifulSoup
import urllib2
import re
comm = re.compile("<!--|-->")
def make_soup(url):
    page = urllib2.urlopen(url)
    soupdata = BeautifulSoup(comm.sub("", page.read()), 'lxml')
    return soupdata

def get_player_totals():
    soup = make_soup("http://www.pro-football-reference.com/years/2015/")

    tableStats = soup.find('table', {'id':'team_stats'})

    return tableStats

print get_player_totals()

You will see the table when you run the code.

score 3 · Answer 2 · edited May 23 '17 at 10:34

A couple of thoughts here: first, as mentioned in the comments, the acutal table is commented out and is not per se part of the DOM (not accessible via a parser adhoc).
In this situation, you can loop over the comments found and try to get the information via regular expressions (though this heavily discussed and mostly disliked on Stackoverflow, see here for more information). Last, but not least, I would recommend requests rather than urllib2.

That being said, here is a working code example:

from bs4 import BeautifulSoup, Comment
import requests, re

def make_soup(url):
    r = requests.get(url)
    soupdata = BeautifulSoup(r.text, 'lxml')
    return soupdata

soup = make_soup("http://www.pro-football-reference.com/years/2015/")

# get the comments
comments = soup.findAll(text=lambda text:isinstance(text, Comment))

# look for table with the id "team_stats"
rx = re.compile(r'<table.+?id="team_stats".+?>[\s\S]+?</table>')
for comment in comments:
    try:
        table = rx.search(comment.string).group(0)
        print(table)
        # break the loop if found
        break
    except:
        pass

Always favour a parser solution over this one, especially the regex part is very error-prone.

BeautifulSoup webscraper issue: can't find certain divs/tables

2 Answers2