0

I have these data taken from soccer stat site:

[{'id': '18209', 'isResult': True, 'side': 'h', 'h': {'id': '89', 'title': 'Manchester United', 'short_title': 'MUN'}, 'a': {'id': '220', 'title': 'Brighton', 'short_title': 'BRI'}, 'goals': {'h': '1', 'a': '2'}, 'xG': {'h': '1.42103', 'a': '1.7289'}, 'datetime': '2022-08-07 13:00:00'}, {'id': '18218', 'isResult': True, 'side': 'a', 'h': {'id': '244', 'title': 'Brentford', 'short_title': 'BRE'}, 'a': {'id': '89', 'title': 'Manchester United', 'short_title': 'MUN'}, 'goals': {'h': '4', 'a': '0'}, 'xG': {'h': '1.38785', 'a': '0.896038'}, 'datetime': '2022-08-13 16:30:00'}, {'id': '18231', 'isResult': True, 'side': 'h', 'h': {'id': '89', 'title': 'Manchester United', 'short_title': 'MUN'}, 'a': {'id': '87', 'title': 'Liverpool', 'short_title': 'LIV'}, 'goals': {'h': '2', 'a': '1'}, 'xG': {'h': '2.01764', 'a': '1.52301'}, 'datetime': '2022-08-22 19:00:00'}, {'id': '18232', 'isResult': True, 'side': 'a', 'h': {'id': '74', 'title': 'Southampton', 'short_title': 'SOU'}, 'a': {'id': '89', 'title': 'Manchester United', 'short_title': 'MUN'}, 'goals': {'h': '0', 'a': '1'}, 'xG': {'h': '1.35887', 'a': '1.34359'}, 'datetime': '2022-08-27 11:30:00'}]

If I put them into a dataframe and then into CSV, I obtain this:

      id  isResult  ...                                 xG             datetime
0  18209      True  ...    {'h': '1.42103', 'a': '1.7289'}  2022-08-07 13:00:00
1  18218      True  ...  {'h': '1.38785', 'a': '0.896038'}  2022-08-13 16:30:00
2  18231      True  ...   {'h': '2.01764', 'a': '1.52301'}  2022-08-22 19:00:00
3  18232      True  ...   {'h': '1.35887', 'a': '1.34359'}  2022-08-27 11:30:00

The part in braces is not split. Is there a way to get also this part split into pandas dataframe columns?

This is the code:

import pandas as pd

ta = [{'id': '18209', 'isResult': True, 'side': 'h', 'h': {'id': '89', 'title': 'Manchester United', 'short_title': 'MUN'}, 'a': {'id': '220', 'title': 'Brighton', 'short_title': 'BRI'}, 'goals': {'h': '1', 'a': '2'}, 'xG': {'h': '1.42103', 'a': '1.7289'}, 'datetime': '2022-08-07 13:00:00'}, {'id': '18218', 'isResult': True, 'side': 'a', 'h': {'id': '244', 'title': 'Brentford', 'short_title': 'BRE'}, 'a': {'id': '89', 'title': 'Manchester United', 'short_title': 'MUN'}, 'goals': {'h': '4', 'a': '0'}, 'xG': {'h': '1.38785', 'a': '0.896038'}, 'datetime': '2022-08-13 16:30:00'}, {'id': '18231', 'isResult': True,'side': 'h', 'h': {'id': '89', 'title': 'Manchester United', 'short_title': 'MUN'}, 'a': {'id': '87', 'title': 'Liverpool', 'short_title': 'LIV'}, 'goals': {'h': '2', 'a': '1'}, 'xG': {'h': '2.01764', 'a': '1.52301'}, 'datetime': '2022-08-22 19:00:00'}, {'id': '18232', 'isResult': True, 'side': 'a', 'h': {'id': '74', 'title': 'Southampton', 'short_title': 'SOU'}, 'a': {'id': '89', 'title': 'Manchester United', 'short_title': 'MUN'}, 'goals': {'h': '0', 'a': '1'}, 'xG': {'h': '1.35887', 'a': '1.34359'}, 'datetime': '2022-08-27 11:30:00'}]

df = pd.DataFrame(ta)

df.to_csv("G:\\stat.csv", header=True)

print(df)
Mark Rotteveel
  • 100,966
  • 191
  • 140
  • 197
eestlane
  • 95
  • 6
  • 1
    Does this answer your question? [Extract data from array - Python](https://stackoverflow.com/questions/73425151/extract-data-from-array-python) , your question is same, only the attributes are different. – ThePyGuy Aug 28 '22 at 09:29
  • It is not the same even if very similar: for example i need the column 'h' to become tre different columns 'id, 'title', short_title' and same for other data into braces; adapting the code of this answer got me only id, title, short_title (and not their value) splitted into 3 rows not columns – eestlane Aug 28 '22 at 10:28
  • 1
    Once you have the columns of your interest, you need to merge it back to the original dataframe as well, I believe the questions marked duplicated to solves your problem. – ThePyGuy Aug 28 '22 at 10:36
  • 1
    Solution in links above df = pd.json_normalize(ta) – Сергей Кох Aug 28 '22 at 11:10

2 Answers2

0

I have been checking the in comments and in duplicate hints provided links expecting a simple pandas command to achieve the goal, but haven't run into this simple solution which gives the same result as the code I came up with. Thanks to @Сергей Кох for providing THIS:

import pandas as pd
ta = [{'id': '18209', 'isResult': True, 'side': 'h', 'h': {'id': '89', 'title': 'Manchester United', 'short_title': 'MUN'}, 'a': {'id': '220', 'title': 'Brighton', 'short_title': 'BRI'}, 'goals': {'h': '1', 'a': '2'}, 'xG': {'h': '1.42103', 'a': '1.7289'}, 'datetime': '2022-08-07 13:00:00'}, {'id': '18218', 'isResult': True, 'side': 'a', 'h': {'id': '244', 'title': 'Brentford', 'short_title': 'BRE'}, 'a': {'id': '89', 'title': 'Manchester United', 'short_title': 'MUN'}, 'goals': {'h': '4', 'a': '0'}, 'xG': {'h': '1.38785', 'a': '0.896038'}, 'datetime': '2022-08-13 16:30:00'}, {'id': '18231', 'isResult': True, 'side': 'h', 'h': {'id': '89', 'title': 'Manchester United', 'short_title': 'MUN'}, 'a': {'id': '87', 'title': 'Liverpool', 'short_title': 'LIV'}, 'goals': {'h': '2', 'a': '1'}, 'xG': {'h': '2.01764', 'a': '1.52301'}, 'datetime': '2022-08-22 19:00:00'}, {'id': '18232', 'isResult': True, 'side': 'a', 'h': {'id': '74', 'title': 'Southampton', 'short_title': 'SOU'}, 'a': {'id': '89', 'title': 'Manchester United', 'short_title': 'MUN'}, 'goals': {'h': '0', 'a': '1'}, 'xG': {'h': '1.35887', 'a': '1.34359'}, 'datetime': '2022-08-27 11:30:00'}]
df = pd.json_normalize(ta)
df.to_csv("pandas_splitting.csv", header=True)

Anyway checking out the code below reveals what the json_normalize() function actually does. The only difference to my code appears to be usage of a '.' and not '_' for creating column names and ... sure it will be a way faster especially on large datasets.

import pandas as pd
ta = [ 
       {'id'      : '18209', 
        'isResult': True, 
        'side'    : 'h', 
        'h'       : {'id': '89', 'title': 'Manchester United', 'short_title': 'MUN'}, 
        'a'       : {'id': '220', 'title': 'Brighton', 'short_title': 'BRI'}, 
        'goals'   : {'h': '1', 'a': '2'}, 
        'xG': {'h': '1.42103', 'a': '1.7289'}, 
        'datetime': '2022-08-07 13:00:00'
       }, 
       {'id'      : '18218', 
        'isResult': True, 
        'side'    : 'a', 
        'h': {'id': '244', 'title': 'Brentford', 'short_title': 'BRE'}, 
        'a': {'id': '89', 'title': 'Manchester United', 'short_title': 'MUN'}, 
        'goals'   : {'h': '4', 'a': '0'}, 
        'xG': {'h': '1.38785', 'a': '0.896038'}, 
        'datetime': '2022-08-13 16:30:00'
       }, {'id': '18231', 'isResult': True,'side': 'h', 'h': {'id': '89', 'title': 'Manchester United', 'short_title': 'MUN'}, 'a': {'id': '87', 'title': 'Liverpool', 'short_title': 'LIV'}, 'goals': {'h': '2', 'a': '1'}, 'xG': {'h': '2.01764', 'a': '1.52301'}, 'datetime': '2022-08-22 19:00:00'}, {'id': '18232', 'isResult': True, 'side': 'a', 'h': {'id': '74', 'title': 'Southampton', 'short_title': 'SOU'}, 'a': {'id': '89', 'title': 'Manchester United', 'short_title': 'MUN'}, 'goals': {'h': '0', 'a': '1'}, 'xG': {'h': '1.35887', 'a': '1.34359'}, 'datetime': '2022-08-27 11:30:00'}]
for dct in ta:
    h     = dct['h']
    a     = dct['a']
    goals = dct['goals']
    xG    = dct['xG']
    for key, value in h.items(): 
        dct['h_'+key] = value
    dct.pop('h')
    for key, value in a.items(): 
        dct['a_'+key] = value
    dct.pop('a')
    for key, value in goals.items(): 
        dct['goals_'+key] = value
    dct.pop('goals')
    for key, value in xG.items(): 
        dct['xG_'+key] = value
    dct.pop('xG')

df = pd.DataFrame(ta)
print(df)
df.to_csv("pandas_splitting.csv", header=True)
print_df = """
     id  isResult side             datetime  ... goals_h goals_a     xG_h      xG_a
0  18209      True    h  2022-08-07 13:00:00  ...       1       2  1.42103    1.7289
1  18218      True    a  2022-08-13 16:30:00  ...       4       0  1.38785  0.896038
2  18231      True    h  2022-08-22 19:00:00  ...       2       1  2.01764   1.52301
3  18232      True    a  2022-08-27 11:30:00  ...       0       1  1.35887   1.34359
"""

gives:

     id  isResult side             datetime  ... goals_h goals_a     xG_h      xG_a
0  18209      True    h  2022-08-07 13:00:00  ...       1       2  1.42103    1.7289
1  18218      True    a  2022-08-13 16:30:00  ...       4       0  1.38785  0.896038
2  18231      True    h  2022-08-22 19:00:00  ...       2       1  2.01764   1.52301
3  18232      True    a  2022-08-27 11:30:00  ...       0       1  1.35887   1.34359

with following CSV file content:

,id,isResult,side,datetime,h_id,h_title,h_short_title,a_id,a_title,a_short_title,goals_h,goals_a,xG_h,xG_a
0,18209,True,h,2022-08-07 13:00:00,89,Manchester United,MUN,220,Brighton,BRI,1,2,1.42103,1.7289
1,18218,True,a,2022-08-13 16:30:00,244,Brentford,BRE,89,Manchester United,MUN,4,0,1.38785,0.896038
2,18231,True,h,2022-08-22 19:00:00,89,Manchester United,MUN,87,Liverpool,LIV,2,1,2.01764,1.52301
3,18232,True,a,2022-08-27 11:30:00,74,Southampton,SOU,89,Manchester United,MUN,0,1,1.35887,1.34359
Claudio
  • 7,474
  • 3
  • 18
  • 48
-1

What you can do is to load them as JSON, then create a dictionary, where the keys are the columns' names and the values are a list of the values

the values can be filled by iterating over the examples you have

then used pd.DataFrame(data)