0

I am trying to extract JSON keys and associated data nested in an API response into a Pandas DataFrame with each key and associated data element as a separate column.

I have tried the solution HERE: Parsing a JSON string which was loaded from a CSV using Pandas

But there are 2 problems. First, I have to convert the API request response into a CSV, and then back from a CSV to a DataFrame, which seems like a wasted step, but I was willing to do it if this would otherwise work.

Second, even when I do that, I get "JSON DecodeError: Expecting property name enclosed in double quotes."

I have also tried and failed with the solution described HERE: instructional at https://www.kaggle.com/jboysen/quick-tutorial-flatten-nested-json-in-pandas

import requests
from pandas import DataFrame
import pandas as pd
import json

teamgamedata_url = 'https://api.collegefootballdata.com/games/teams?year=2019&week=1&seasonType=regular'

teamgamedataresp = requests.get(teamgamedata_url)

dftg = DataFrame(teamgamedataresp.json())

This works, but produces a column 'teams' with lots of nested data so to try to follow the csv advice, I convert to CSV

dftg.to_csv(r'/path/teamgameinfoapi.csv')

def CustomParser(data):
    j1 = json.loads(data)
    return j1

csvtodf = pd.read_csv('/path/teamgameinfoapi.csv', 
                      converters={'teams':CustomParser}, header=0)

csvtodf[sorted(csvtodf['teams'][0].keys())] = csvtodf['teams'].apply(pd.Series)

I expected the CustomParser to extract the JSON data into separate columns, but get:

JSONDecodeError: Expecting property name enclosed in double quotes

I expected the last line of code to append the columns to the dataframe, but instead got:

KeyError: 'teams'

Caconde
  • 4,177
  • 7
  • 35
  • 32
janalytics
  • 25
  • 6

3 Answers3

1

Efficient and more pandas way of doing this

On pandas >= 0.25.1

teamgamedataresp = requests.get(teamgamedata_url)

d = teamgamedataresp.json()

# errors='ignore' used because some records may not have id, in that case it will throw error, I'm ignoring it here
teams_df = pd.io.json.json_normalize(d, ['teams'], ['id'], errors='ignore')
print(teams_df)

teams_df = teams_df.explode('stats')
print(teams_df)

stats_df = pd.io.json.json_normalize(teams_df['stats'])
print(stats_df)

teams_df.drop(columns=['stats'], inplace=True)
data = pd.concat([teams_df.reset_index(drop=True), stats_df.reset_index(drop=True)], axis=1)
print(data)

TL;DR (Data displayed for understanding purposes)

Normalizing first level of records i.e. teams

        school         conference homeAway  points                                              stats         id
0       Vanderbilt                SEC     home       6  [{'category': 'rushingTDs', 'stat': '0'}, {'ca...  401110732
1          Georgia                SEC     away      30  [{'category': 'rushingTDs', 'stat': '2'}, {'ca...  401110732
2            Miami                ACC     home      20  [{'category': 'rushingTDs', 'stat': '1'}, {'ca...  401110723
3          Florida                SEC     away      24  [{'category': 'rushingTDs', 'stat': '1'}, {'ca...  401110723
4    Georgia State           Sun Belt     away      38  [{'category': 'rushingTDs', 'stat': '3'}, {'ca...  401110730
..             ...                ...      ...     ...                                                ...        ...
163           Navy  American Athletic     home      45  [{'category': 'rushingTDs', 'stat': '6'}, {'ca...  401117857
164   Gardner-Webb               None     away      28  [{'category': 'rushingTDs', 'stat': '3'}, {'ca...  401135910
165      Charlotte     Conference USA     home      49  [{'category': 'rushingTDs', 'stat': '4'}, {'ca...  401135910
166  Alabama State               None     away      19  [{'category': 'rushingTDs', 'stat': '1'}, {'ca...  401114237
167            UAB     Conference USA     home      24  [{'category': 'rushingTDs', 'stat': '1'}, {'ca...  401114237

[168 rows x 6 columns]

Exploding list in stats column into rows

         school      conference homeAway  points                                           stats         id
0    Vanderbilt             SEC     home       6         {'category': 'rushingTDs', 'stat': '0'}  401110732
0    Vanderbilt             SEC     home       6         {'category': 'passingTDs', 'stat': '0'}  401110732
0    Vanderbilt             SEC     home       6   {'category': 'kickReturnYards', 'stat': '35'}  401110732
0    Vanderbilt             SEC     home       6      {'category': 'kickReturnTDs', 'stat': '0'}  401110732
0    Vanderbilt             SEC     home       6        {'category': 'kickReturns', 'stat': '2'}  401110732
..          ...             ...      ...     ...                                             ...        ...
167         UAB  Conference USA     home      24  {'category': 'netPassingYards', 'stat': '114'}  401114237
167         UAB  Conference USA     home      24       {'category': 'totalYards', 'stat': '290'}  401114237
167         UAB  Conference USA     home      24    {'category': 'fourthDownEff', 'stat': '0-1'}  401114237
167         UAB  Conference USA     home      24    {'category': 'thirdDownEff', 'stat': '1-13'}  401114237
167         UAB  Conference USA     home      24        {'category': 'firstDowns', 'stat': '16'}  401114237

[3927 rows x 6 columns]

Normalize the stats column to get a dataframe

             category  stat
0          rushingTDs     0
1          passingTDs     0
2     kickReturnYards    35
3       kickReturnTDs     0
4         kickReturns     2
...               ...   ...
3922  netPassingYards   114
3923       totalYards   290
3924    fourthDownEff   0-1
3925     thirdDownEff  1-13
3926       firstDowns    16

[3927 rows x 2 columns]

Finally, Merge both dataframes.

          school      conference homeAway  points         id         category  stat
0     Vanderbilt             SEC     home       6  401110732       rushingTDs     0
1     Vanderbilt             SEC     home       6  401110732       passingTDs     0
2     Vanderbilt             SEC     home       6  401110732  kickReturnYards    35
3     Vanderbilt             SEC     home       6  401110732    kickReturnTDs     0
4     Vanderbilt             SEC     home       6  401110732      kickReturns     2
...          ...             ...      ...     ...        ...              ...   ...
3922         UAB  Conference USA     home      24  401114237  netPassingYards   114
3923         UAB  Conference USA     home      24  401114237       totalYards   290
3924         UAB  Conference USA     home      24  401114237    fourthDownEff   0-1
3925         UAB  Conference USA     home      24  401114237     thirdDownEff  1-13
3926         UAB  Conference USA     home      24  401114237       firstDowns    16

[3927 rows x 7 columns]
Vishnudev Krishnadas
  • 10,679
  • 2
  • 23
  • 55
  • Thank you. I am going to work through this after I work through the original response because I want to clearly understand what's going on in each so I develop my own skillset. – janalytics Sep 02 '19 at 17:22
  • I'll help with that. I'll Update my answer – Vishnudev Krishnadas Sep 02 '19 at 17:23
  • Extraordinarily helpful, @Vishnudev. – janalytics Sep 02 '19 at 18:03
  • If you find it helpful, encourage by upvoting/accepting as answer. @janalytics – Vishnudev Krishnadas Sep 02 '19 at 18:05
  • 1
    in this particular case, the true first level of record is probably game, defined by 'id'. But I see the value in being able to define by other record levels as well. Depending upon use case, this can be far easier to read visually. But I guess that's where learning Panda well is important -- developing the ability to structure dataframes differently depending upon the goal. – janalytics Sep 02 '19 at 18:08
  • Did both upvote and answer . . . I'm brand new so I'll need to develop a few more reputation points before they show up. – janalytics Sep 02 '19 at 18:09
  • Not a problem @janalytics. Enjoy exploring SO :) – Vishnudev Krishnadas Sep 02 '19 at 18:10
  • Good Question, updating answer. – Vishnudev Krishnadas Sep 02 '19 at 18:13
  • OK, there's something I'm not following. Attempting this on a different dataset: `games_df = pd.io.json.json_normalize(d, ['id'], ['homeTeam'], ['homeScore'], ['awayTeam'], ['awayScore'], errors='ignore')`. This yields "TypeError: 'int' object is not iterable. I assume that int = integer, but if that were the problem it should have broken the example above, which is using the same id number. – janalytics Sep 02 '19 at 20:47
0

I would rather drill down manually, as the structure is complex:

a = teamgamedataresp.json()
buf = []
for game in a:
    for team in game['teams']:
        game_dict = dict(id=game['id'])
        for cat in ('school', 'conference', 'homeAway', 'points'):
            game_dict[cat] = team[cat]
        for stat in team['stats']:
            game_dict[stat['category']] = stat['stat']
        buf.append(game_dict)

>>> df = pd.DataFrame(buf)
>>> df
            id         school         conference homeAway  points  ... puntReturnTDs puntReturns interceptionYards interceptionTDs passesIntercepted
0    401110732     Vanderbilt                SEC     home       6  ...           NaN         NaN               NaN             NaN               NaN
1    401110732        Georgia                SEC     away      30  ...             0           4               NaN             NaN               NaN
2    401110723          Miami                ACC     home      20  ...             0           1                41               0                 2
3    401110723        Florida                SEC     away      24  ...             0           3               NaN             NaN               NaN
4    401110730  Georgia State           Sun Belt     away      38  ...           NaN         NaN                 0               0                 1
..         ...            ...                ...      ...     ...  ...           ...         ...               ...             ...               ...
163  401117857           Navy  American Athletic     home      45  ...             0           1               NaN             NaN               NaN
164  401135910   Gardner-Webb               None     away      28  ...           NaN         NaN                45               0                 3
165  401135910      Charlotte     Conference USA     home      49  ...             1           3               NaN             NaN               NaN
166  401114237  Alabama State               None     away      19  ...             0           2               NaN             NaN               NaN
167  401114237            UAB     Conference USA     home      24  ...             0           2                 0               0                 1

[168 rows x 31 columns]
crayxt
  • 2,367
  • 2
  • 12
  • 17
  • Hmm. Did you take a step to reorder the columns? Mine come out in a different order. Also, if you'll forgive a newbie question, how do you get your code to display that way (as opposed to the raw form in which mine appeared)? – janalytics Sep 02 '19 at 16:10
  • Nope I did not. Could be pandas version? Mine is `0.25.1` in Linux Anaconda Python 3.7 environment. For code formatting, you have to select and mark as code in Editing box. For Pandas version I mean Python version, as since Python 3.6 all dictionaries now are Ordered Dictionaries, which means they preserve order of keys' insertion. – crayxt Sep 02 '19 at 16:14
  • Mine is 0.24.2 in Mac Anaconda Python 7 environment. It seems to want to put the columns in alphabetical order, which is a bit annoying. Your solution worked great, but since I'm learning, I want to confirm my understanding of how you attacked this. Will try to post commented code back with the select-and-mark method. Thanks for your patience. – janalytics Sep 02 '19 at 17:09
  • having to do this by answering my own question because of space limitations. – janalytics Sep 02 '19 at 18:02
0

I was unable to fit this into a comment; I apologize for breaking protocol. Hope I have followed you, @crayxt. Please let me know if I've misunderstood.

a = teamgamedataresp.json()
#defining a as the response to our get request for this data, in JSON format 
buf = []
#Not sure what this means, but I think we're creating a bucket where a new dict can go
for game in a:
    #defining game as the key unit for evaluation in our desired dataset
    for team in game['teams']:
        #telling python that for each unique game we want to identify each
        #team identified by the 'teams' key, and  . . . 
        game_dict = dict(id=game['id'])
        #tells python that our key unit, game, is tied to the game ID#
        for cat in ('school', 'conference', 'homeAway', 'points'):
            #defining the categories where there is no further nesting to explore
            game_dict[cat]=team[cat]
            #we're making sure that python knows that the categories used for the game
            #should be the same one used for the team.
        for stat in team['stats']:
            game_dict[stat['category']] = stat['stat']
            #now we need to dive into the further nesting in the stats category.
            #each team's statistics should be labeled with the JSON 'category" key
            #and given a value equal to the "stat" value key given in JSON
        buf.append(game_dict)
        #adds the newly created stat columns to the dataset
janalytics
  • 25
  • 6