2

Hi I want to create a DataFrame from a list of dicts where the items are lists. When the items are scalars, see test below, the call to pd.DataFrame works as expected:

test = [{'points': 40, 'time': '5:00', 'year': 2010}, 
{'points': 25, 'time': '6:00', 'month': "february"}, 
{'points':90, 'time': '9:00', 'month': 'january'}, 
{'points_h1':20, 'month': 'june'}]

pd.DataFrame(test)

    month    points  points_h1  time    year
0   NaN      40.0    NaN        5:00    2010.0
1   february 25.0    NaN        6:00    NaN
2   january  90.0    NaN        9:00    NaN
3   june      NaN    20.0        NaN    NaN

However, if the items are lists themselves, I get what seems to be an unexpected result:

test = [{'points': [40, 50], 'time': ['5:00', '4:00'], 'year': [2010, 2011]}, 
{'points': [25], 'time': ['6:00'], 'month': ["february"]}, 
{'points':[90], 'time': ['9:00'], 'month': ['january']}, 
{'points_h1': [20], 'month': ['june']}]

pd.DataFrame(test)

        month      points   points_h1          time            year
   0    NaN      [40, 50]   NaN         [5:00, 4:00]    [2010, 2011]
   1    february       25   NaN                 6:00             NaN
   2    january        90   NaN                 9:00             NaN
   3    june          NaN   20.0                 NaN             NaN

To solve this, I use: pd.concat([pd.DataFrame(z) for z in test]), but this is relatively slow because you have to create a new dataframe for each element in the list, which requires significant overhead. Am I missing something?

Ryan Erwin
  • 807
  • 1
  • 11
  • 30

2 Answers2

0

Although possible within pandas itself, it appears to be less difficult using Python, at least if you have the raw data.

import pandas as pd

test = [{'points': [40, 50], 'time': ['5:00', '4:00'], 'year': [2010, 2011]}, {'points': [25], 'time': ['6:00'], 'month': ["february"]}, {'points':[90], 'time': ['9:00'], 'month': ['january']}, {'points_h1': [20], 'month': ['june']}]

newtest = []
for t in test:
    newtest.extend([{k:v for (k,v) in zip(t.keys(),values)} for values in zip(*t.values())])

df = pd.DataFrame(newtest)
print (df)

Result:

      month  points  points_h1  time    year
0       NaN    40.0        NaN  5:00  2010.0
1       NaN    50.0        NaN  4:00  2011.0
2  february    25.0        NaN  6:00     NaN
3   january    90.0        NaN  9:00     NaN
4      june     NaN       20.0   NaN     NaN
Bill Bell
  • 21,021
  • 5
  • 43
  • 58
0

With pandas it is possible to use a combination of methods to get your data, but as you found out it can get quite heavy. My recommendation is pad your data before passing onto pandas:

import pandas as pd

test = [{'points': [40, 50], 'time': ['5:00', '4:00'], 'year': [2010, 2011]},
 {'month': ['february'], 'points': [25], 'time': ['6:00']},
 {'month': ['january'], 'points': [90], 'time': ['9:00']},
 {'month': ['june'], 'points_h1': [20]}]

def pad_data(data):

    # Set a dictionary with all the keys
    result = {k:[] for i in data for k in i.keys()}

    for i in data:

        # Determine the longest value as padding for NaNs
        pad = max([len(j) for j in i.values()])

        # Create padding dictionary and update current
        padded = {key: [pd.np.nan]*pad for key in result.keys() if key not in i.keys()}
        i.update(padded)

        # Finally extend to result dictionary
        for key, val in i.items():
            result[key].extend(val)

    return result

# Padded data looks like this:
#
# {'month': [nan, nan, 'february', 'january', 'june'],
#  'points': [40, 50, 25, 90, nan],
#  'points_h1': [nan, nan, nan, nan, 20],
#  'time': ['5:00', '4:00', '6:00', '9:00', nan],
#  'year': [2010, 2011, nan, nan, nan]}

df = pd.DataFrame(pad_data(test), dtype='O')
print(df)

#       month points points_h1  time  year
# 0       NaN     40       NaN  5:00  2010
# 1       NaN     50       NaN  4:00  2011
# 2  february     25       NaN  6:00   NaN
# 3   january     90       NaN  9:00   NaN
# 4      june    NaN        20   NaN   NaN
r.ook
  • 13,466
  • 2
  • 22
  • 39