Python and Beautiful Soup Web Scraping

Question

I am trying to scrape the stats off the table on this webpage: http://stats.nba.com/teams/traditional/ but I am unable to find the html for the table. This is in python 2.7.10.

from bs4 import BeautifulSoup
import json
import urllib

html = urllib.urlopen('http://stats.nba.com/teams/traditional/').read()

soup = BeautifulSoup(html, "html.parser")


for table in soup.find_all('tr'):
    print(table)

This is the code I have now, but nothing is being outputted. If I try this with different elements on the page it works fine.

the table values are rendered via JavaScript so you are going to need a JavaScript parser to obtain the values, as opposed to BeautifulSoup — smoggers, Dec 16 '16 at 19:14
You don't really have to use a javascript parser if you know where the data comes from, in this case it's http://stats.nba.com/stats/leaguedashteamstats?Conference=&DateFrom=&DateTo=&Division=&GameScope=&GameSegment=&LastNGames=0&LeagueID=00&Location=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2016-17&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision= — Shane, Dec 16 '16 at 19:25
@Shane gives us the JSON format. Then use this to get python code http://stackoverflow.com/questions/2835559/parsing-values-from-a-json-file — Anurag Joshi, Dec 16 '16 at 19:26
How would you get the statistics out of the json file? I looked at the linked question and have been trying for a while, but I can't get anywhere. — johnbowman, Dec 16 '16 at 20:50

score 0 · Answer 1 · answered Dec 16 '16 at 19:09

0

The table is loaded dynamically, so when you grab the html, there are no tr tags in it to be found.

answered Dec 16 '16 at 19:09

Scott Hunter

48,888
12
60
101

score 0 · Answer 2 · edited May 23 '17 at 12:30

The table you're looking for is NOT in that specific page/URL.

The stats you're trying to scrape come from this url:

http://stats.nba.com/stats/leaguedashteamstats?Conference=&DateFrom=&DateTo=&Division=&GameScope=&GameSegment=&LastNGames=0&LeagueID=00&Location=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2016-17&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=

When you browse a webpage/url in a modern browser, more requests are made "behind the scene" other than the original url you use to fully render the whole page.

I know this sounds counter-intuitive, you can check out this answer for a bit more detailed explanation.

score 0 · Answer 3 · answered Dec 16 '16 at 19:22

Try this code. It is giving me the HTML code. I am using requests to obtain information.

    import datetime
    import BeautifulSoup
    import os
    import sys
    import pdb 
    import webbrowser
    import urllib2
    import requests
    from datetime import datetime
    from requests.auth import HTTPBasicAuth
    from HTMLParser import HTMLParser
    from urllib import urlopen
    from bs4 import BeautifulSoup
    url="http://stats.nba.com/teams/traditional/"
    data=requests.get(url)

    if (data.status_code<400):
        print("AUTHENTICATED:STATUS_CODE"+" "+str(data.status_code))
        sample=data.content
        soup=BeautifulSoup(sample,'html.parser')
        print soup

score 0 · Answer 4 · answered Dec 18 '16 at 19:32

You can use selenium and PhantomJS (or chomedriver, firefox etc.) to load the page, thereby also loading all the javascript. All you need is to download selenium and the PhantomJS webdriver, then place a sleep timer after the get(url) to ensure that the page loads (actually, using a function such as WebDriverWait would be much better than sleep, but you can look more into that if you need it). Now your soup content will look exactly like that what you see when looking at the site through your browser.

from bs4 import BeautifulSoup
from selenium import webdriver
from time import sleep

url = 'http://stats.nba.com/teams/traditional/'
browser = webdriver.PhantomJS('*path to PhantomJS driver')
browser.get(url)

sleep(10)

soup = BeautifulSoup(browser.page_source, "html.parser")
for table in soup.find_all('tr'):
    print(table)

Python and Beautiful Soup Web Scraping

4 Answers4