0

I export a Postgres SQL query to create the following Pandas data frame df:

df= pd.DataFrame({
    'employee_id' : [123, 456, 789],
    'sales' : [{'foo':2, 'bar': 0, 'baz': 1},
               {'foo':3, 'bar': 1, 'baz': 2},
               {'foo':7, 'bar': 0, 'baz': 4}]
})

df
 
    employee_id sales
0   123 {'foo': 2, 'bar': 0, 'baz': 1}
1   456 {'foo': 3, 'bar': 1, 'baz': 2}
2   789 {'foo': 7, 'bar': 0, 'baz': 4}

The data types are:

df.dtypes

employee_id     int64
sales          object
dtype: object

I would like to split the sales column of dictionaries into separate columns by the keys, as follows:

    employee_id   foo    bar   baz
0   123           2      0     1
1   456           3      1     2
2   789           7      0     4

Following the advise in Split / Explode a column of dictionaries into separate columns with pandas, I do so using the following:

df2 = pd.json_normalize(df['sales'])

Which gives:

df2.head()

0
1
2
3
4

When I apply this approach to non-production data, it works.

How do I split the sales column into 3 separate columns?

Thanks!

zeroes_ones
  • 171
  • 7

1 Answers1

0

You could do the following thing:

Use the function:

def flatten_nested_json_df(df):
    df = df.reset_index()
    s = (df.applymap(type) == list).all()
    list_columns = s[s].index.tolist()
    
    s = (df.applymap(type) == dict).all()
    dict_columns = s[s].index.tolist()

    
    while len(list_columns) > 0 or len(dict_columns) > 0:
        new_columns = []

        for col in dict_columns:
            horiz_exploded = pd.json_normalize(df[col]).add_prefix(f'{col}.')
            horiz_exploded.index = df.index
            df = pd.concat([df, horiz_exploded], axis=1).drop(columns=[col])
            new_columns.extend(horiz_exploded.columns) # inplace

        for col in list_columns:
            print(f"exploding: {col}")
            df = df.drop(columns=[col]).join(df[col].explode().to_frame())
            new_columns.append(col)

        s = (df[new_columns].applymap(type) == list).all()
        list_columns = s[s].index.tolist()

        s = (df[new_columns].applymap(type) == dict).all()
        dict_columns = s[s].index.tolist()
    return df

This function works on any nested df and explodes all columns:

flatten_nested_json_df(df)

which returns:

   index  employee_id  sales.foo  sales.bar  sales.baz
0      0          123          2          0          1
1      1          456          3          1          2
2      2          789          7          0          4