0

EDIT: I'm making a lot of head way, I am now trying to parse out single columns from my JSON file, not the entire row. I am getting an error however whenever I try to manipulate my DataFrame to get the results I want.

The error is: line 52, in df = pd.DataFrame.from_dict(mlbJson['stats_sortable_player']['queryResults']['name_display_first_last']) KeyError: 'name_display_first_last'

It only happens when I try to add another parameter, for instance i took out ['row'] and added ['name_display_first_last'] to get the first and last name of each player. If I leave in ['row'] it compiles, but gives me all the data, I only want certain snippets.

Any help would be greatly appreciated! Thanks.

import requests
import pandas as pd
from bs4 import BeautifulSoup

# Scraping data from MLB.com

target = [MLB JSON][1]
mlbResponse = requests.get(target)
mlbJson = mlbResponse.json()
# Placing response in variable


# Collecting data and giving it names in pandas
data = {'Team': team, 'Line': line}
# places data table format, frames the data
table = pd.DataFrame(data)
# Creates excel file named Scrape
writer = pd.ExcelWriter('Scrape.xlsx')
# Moves table to excel taking in Parameters , 'Name of DOC and Sheet on that Doc'
table.to_excel(writer, 'Lines 1')


#stats = {'Name': name, 'Games': games, 'AtBats': ab, 'Runs': runs, 'Hits': hits, 'Doubles': doubles, 'Triples': triples, 'HR': hr, 'RBI': rbi, 'Walks': walks, 'SB': sb}
df = pd.DataFrame.from_dict(mlbJson['stats_sortable_player']['queryResults']['row'])
df.to_excel(writer, 'Batting 2')

# Saves File

writer.save()

1 Answers1

0

It looks like the website loads the data asynchronously through another request to a different URL. The response you're getting has empty <datagrid><\datagrid> tag, and soup2.select_one("#datagrid").find_next("table") returns None.

You can use the developer tools in your browser under the network tab to find the URL to actually load data, it looks like :

http://mlb.mlb.com/pubajax/wf/flow/stats.splayer?season=2016&sort_order=%27desc%27&sort_column=%27avg%27&stat_type=hitting&page_type=SortablePlayer&game_type=%27R%27&player_pool=ALL&season_type=ANY&sport_code=%27mlb%27&results=1000&recSP=1&recPP=50

You can modify your code to make a request to this URL, which returns json

mlbResponse = requests.get(url)
mlbJson = mlbResponse.json() # python 3, otherwise use json.loads(mlbResponse.content) 
df = pd.DataFrame(doc['stats_sortable_player']['queryResults']['row'])

The DataFrame has 54 columns, so I can't display it here, but you should be able to pick and rename the columns you need.

user666
  • 5,231
  • 2
  • 26
  • 35
  • Thanks for the response, I'm still tweaking can't quite seem to get it, I'm getting a –  Jul 28 '16 at 07:32
  • So I create soup from that link instead? And when I try to make a data frame, that is where I will put the certain columns that I want to move into my excel file? @user666 –  Jul 28 '16 at 18:09
  • Did you try to run the code in the answer and look at the DataFrame? You can select the columns in the resulting DataFrame, and save it to an excel file. You don't need BeautifulSoup here because the URL returns JSON. – user666 Jul 29 '16 at 02:59
  • Yes, it partially worked! I figured the rest out by using df.loc[:['tags of columns I wanted] Thanks for the help @user666! –  Jul 29 '16 at 03:00