2

I am trying to build a database containing play by play data for several seasons of NBA games, for my Msc. in economics dissertation. Currently I am extracting games from the NBA's API (see example) and splitting each game into a different .json file using this routine (duly adapted for p-b-p purposes), thus yielding .json files as (first play example):

{"headers": ["GAME_ID", "EVENTNUM", "EVENTMSGTYPE", "EVENTMSGACTIONTYPE", "PERIOD", "WCTIMESTRING", "PCTIMESTRING", "HOMEDESCRIPTION", "NEUTRALDESCRIPTION", "VISITORDESCRIPTION", "SCORE", "SCOREMARGIN"], "rowSet": [["0041400406", 0, 12, 0, 1, "9:11 PM", "12:00", null, null, null, null, null], ["0041400406", 1, 10, 0, 1, "9:11 PM", "12:00", "Jump Ball Mozgov vs. Green: Tip to Barnes", null, null, null, null]

I plan on creating a loop to convert all of the generated .json files to .csv, such that it allows me to proceed to econometric analysis in stata. At the moment, I am stuck in the first step of this procedure: the creation of the json to CSV conversion process (I will design the loop afterwards). The code I am trying is:

f = open('pbp_0041400406.json') 
data = json.load(f) 
f.close()

with open("pbp_0041400406.csv", "w") as file:
    csv_file = csv.writer(file)

    for rowSet in data:
        csv_file.writerow(rowSet)

f.close()

However, the yielded CSV files are showing awkward results: one line reading h,e,a,d,e,r,s and another reading r,o,w,S,e,t, thus not capturing the headlines or rowSet(the plays themselves).

I have tried to solve this problem taking into account the contributes on this thread, but I have not been able to do it. Can anybody please provide me some insight into solving this problem?

[EDIT] Replacing rowset with data in the original code also yielded the same results.

Thanks in advance!

Community
  • 1
  • 1
  • Hey @SNygard, thanks for the hint. I printed rowSet and it yielded name not defined. After printing data, the original .json file did show up on the shell. However, after replacing data for rowset in the original code, the returned .csv yielded the same result as before. – André Dias Feb 16 '16 at 23:03
  • 2
    Given the sample JSON, `data` is going to be a [dict](https://docs.python.org/2/tutorial/datastructures.html#dictionaries) with two keys, `headers` and `rowSet`. So, looping over the rows would require `for row in data['rowSet']`. – Kurt Raschke Feb 16 '16 at 23:08

3 Answers3

2

try this:

import json
import csv

with open('json.json') as f:
    data = json.load(f)


with open("pbp_0041400406.csv", "w") as fout:
    csv_file = csv.writer(fout, quotechar='"')

    csv_file.writerow(data['headers'])

    for rowSet in data['rowSet']:
        csv_file.writerow(rowSet)

Resulting CSV:

GAME_ID,EVENTNUM,EVENTMSGTYPE,EVENTMSGACTIONTYPE,PERIOD,WCTIMESTRING,PCTIMESTRING,HOMEDESCRIPTION,NEUTRALDESCRIPTION,VISITORDESCRIPTION,SCORE,SCOREMARGIN

0041400406,0,12,0,1,9:11 PM,12:00,,,,,

0041400406,1,10,0,1,9:11 PM,12:00,Jump Ball Mozgov vs. Green: Tip to Barnes,,,,
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • Hi @MaxU, thanks for your reply! I tried to repeat it and I was successful in producing the .CSV file I was looking for. However, now I am struggling with another problem, since I need to loop this code over a number of files in the same directory. The idea is to run the same exact code on a number of .JSON files in the same path, irrespective of their name/order/etc. I have tried some workarounds with 'while' and 'for' code with created objects, but so far unsuccessfully. Do you have any suggestion to this problem? – André Dias Feb 17 '16 at 14:21
1

I think you might have made a mistake regarding the structure of the json input. There are three keys at the top level. resultSets -> list whose first element is a dictionary with key 'rowSet'. That's what I think you want to iterate over.

f = open('playbyplay', 'r')
data = json.load(f)
f.close()
print data.keys()
rows = data['resultSets'][0]['rowSet']

with open("pbp_0041400406.csv", "w") as file:
    csv_file = csv.writer(file)
    for rowSet in rows:
       csv_file.writerow(rowSet) 

Output data:

0041300402,0,12,0,1,8:13 PM,12:00,,,,,,0,0,,,,,,0,0,,,,,,0,0,,,,,
0041300402,1,10,0,1,8:13 PM,12:00,Jump Ball Duncan vs. Bosh: Tip to Wade,,,,,4,1495,Tim Duncan,1610612759,San Antonio,Spurs,SAS,5,2547,Chris Bosh,1610612748,Miami,Heat,MIA,5,2548,Dwyane Wade,1610612748,Miami,Heat,MIA
0041300402,2,5,2,1,8:13 PM,11:45,Green STEAL (1 STL),,James Lost Ball Turnover (P1.T1),,,5,2544,LeBron James,1610612748,Miami,Heat,MIA,4,201980,Danny Green,1610612759,San Antonio,Spurs,SAS,0,0,,,,,
0041300402,3,1,1,1,8:14 PM,11:26,Green 18' Jump Shot (2 PTS) (Splitter 1 AST),,,0 - 2,2,4,201980,Danny Green,1610612759,San Antonio,Spurs,SAS,4,201168,Tiago Splitter,1610612759,San Antonio,Spurs,SAS,0,0,,,,,
0041300402,4,6,2,1,8:14 PM,11:03,Green S.FOUL (P1.T1),,,,,4,201980,Danny Green,1610612759,San Antonio,Spurs,SAS,5,1740,Rashard Lewis,1610612748,Miami,Heat,MIA,1,0,,,,,
0041300402,5,3,11,1,8:14 PM,11:03,,,Lewis Free Throw 1 of 2 (1 PTS),1 - 2,1,5,1740,Rashard Lewis,1610612748,Miami,Heat,MIA,0,0,,,,,,0,0,,,,,
0041300402,7,3,12,1,8:14 PM,11:03,,,MISS Lewis Free Throw 2 of 2,,,5,1740,Rashard Lewis,1610612748,Miami,Heat,MIA,0,0,,,,,,0,0,,,,,
0041300402,9,4,0,1,8:15 PM,11:01,Splitter REBOUND (Off:0 Def:1),,,,,4,201168,Tiago Splitter,1610612759,San Antonio,Spurs,SAS,0,0,,,,,,0,0,,,,,
0041300402,10,1,5,1,8:15 PM,10:52,Duncan 1' Layup (2 PTS) (Parker 1 AST),,,1 - 4,3,4,1495,Tim Duncan,1610612759,San Antonio,Spurs,SAS,4,2225,Tony Parker,1610612759,San Antonio,Spurs,SAS,0,0,,,,,
Garrett R
  • 2,687
  • 11
  • 15
  • Hi @MaxU, thanks for your reply! I tried to repeat it and I was successful in producing the .CSV file I was looking for, but only when I used the original .json feed. However, now I am struggling with another problem, since I need to loop this code over a number of files in the same directory. The idea is to run the same exact code on a number of .JSON files in the same path, irrespective of their name/order/etc. I have tried some workarounds with 'while' and 'for' code with created objects, but so far unsuccessfully. Do you have any suggestion to this problem? – André Dias Feb 17 '16 at 21:29
-1

Issue solved! Using @MaxU code and a previously constructed .CSV containing all gameid, every nba game since the 01-02 season can be directly scraped via .JSON and converted to CSV using the following code: (Credits for @MaxU)

from __future__ import print_function

import json
import csv
import requests

u_a = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.82 Safari/537.36"
url_pattern = "http://stats.nba.com/stats/playbyplayv2?GameID=%(GameID)s&StartPeriod=%(StartPeriod)s&EndPeriod=%(EndPeriod)s&tabView=%(tabView)s"


def write_csv(game_id, resultSet):
fn = resultSet['name'] + '_' + str(game_id) + '.csv'
# ignore unimportant resultsets ...
if resultSet['name'] not in ['PlayByPlay', 'PlayBlahBlah']:
    return
with open(fn, 'w') as fout:
    csv_file = csv.writer(fout, quotechar='"')

    csv_file.writerow(resultSet['headers'])

    for rowSet in resultSet['rowSet']:
        csv_file.writerow(rowSet)

def process_game_id(game_id, tabView='playbyplay',
                start_period='0', end_period='0'):
url_parms = {
    'GameID': game_id,
    'StartPeriod': start_period,
    'EndPeriod': end_period,
    'tabView': tabView,
}

r = requests.get((url_pattern % url_parms), headers={"USER-AGENT":u_a})

if r.status_code == requests.codes.ok:
    data = json.loads(r.text)

    for rset in data['resultSets']:
        write_csv(url_parms['GameID'], rset)
else:
    r.raise_for_status()


if __name__ == '__main__':
#
# assuming that the 'games.csv' file contains all Game_IDs ...
#
with open('games.csv', 'r') as f:
    csv_reader = csv.reader(f, delimiter=',')
    for row in csv_reader:
        process_game_id(row[<column_num_containing_Game_ID>])

Any further questions on this data, plz do PM me. Happy coding everyone!