Scraping json webpage

Question

I am very new to web scraping and am having some trouble with scraping some NBA player data from nba.com. I first tried to scrape the page using bs4 but ran into an issue which after some research I believe is due to "XHR" from the articles I read. I was able to find a web address to the json formatted data, but my python program seems to bog down, and never load the data. Again I am very new at web scraping, but thought I'd see if I was way off track here... Any suggestions? Thanks! (Code Below)

import requests
import json

url = "http://stats.nba.com/stats/leaguedashplayerstats?College=&Conference=&Country=&DateFrom=&DateTo=&Division=&DraftPick=&DraftYear=&GameScope=&GameSegment=&Height=&LastNGames=0&LeagueID=00&Location=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2017-18&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=&Weight="

resp = requests.get(url=url)
data = json.loads(resp.text)
print(data)

why not look at a library to help ? https://github.com/seemethere/nba_py or at least see how they did it ? — corn3lius, Oct 20 '17 at 20:52

SIM · Accepted Answer · 2017-10-21T12:38:30.417

Give this a shot. It will produce all the categories from that page according to the title I've defined. Btw, you didn't get response in the first place with your initial try cause the webpage was expecting a User-Agent within your request to make sure that the request is not coming from a bot rather from any real browser. However, I faked it and found the response.

import requests

url = "http://stats.nba.com/stats/leaguedashplayerstats?College=&Conference=&Country=&DateFrom=&DateTo=&Division=&DraftPick=&DraftYear=&GameScope=&GameSegment=&Height=&LastNGames=0&LeagueID=00&Location=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2017-18&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=&Weight="
resp = requests.get(url,headers={'User-Agent':'Mozilla/5.0'})
data = resp.json()

storage = data['resultSets']
for elem in storage:
    all_list = elem['rowSet']

    for item in all_list:
        Player_Id = item[0]
        Player_name = item[1]
        Team_Id = item[2]
        Team_abbr = item[3]
        print("Player_Id: {} Player_name: {} Team_Id: {} Team_abbr: {}".format(
            Player_Id,Player_name,Team_Id,Team_abbr))

I tried your method with this url: "http://www.enciclovida.mx/explora-por-region/especies-por-grupo?utf8=\xe2\x9c\x93&grupo_id=Plantas&region_id=&parent_id=&pagina=&nombre=" and I'm always getting a 500, even using headers, any ideas on how can I adapt it? — Elio Diaz, Mar 10 '18 at 07:26
mmm I tried setting a region_id and I get results back, but as these come in pages (pagina=) I'm only getting the first 10, they should be > 500 pages; I saw the basketball example has all data in the same page. Any hints? — Elio Diaz, Mar 10 '18 at 15:33

score 0 · Answer 2 · answered Oct 20 '17 at 21:15

0

Just realized that it is because the user agent headers are different... Once those are added it works

answered Oct 20 '17 at 21:15

johankent30

65
2
3
11

1

you can also use directly the r.json() as shown [here](http://docs.python-requests.org/en/master/) – Thecave3 Oct 20 '17 at 21:21

Scraping json webpage

2 Answers2

Linked