Flatten json to get multiple columns in Pandas

Question

I have a sample dataframe as

sample_df = pd.DataFrame({'id': [1, 2], 'fruits' :[
    [{'name': u'mango', 'cost': 100, 'color': u'yellow', 'size': 12}],
    [{'name': u'mango', 'cost': 150, 'color': u'yellow', 'size': 21},
     {'name': u'banana', 'cost': 200, 'color': u'green', 'size': 10} ]
]})

I would like to flatten the fruits column to get new columns like name, cost, color and size. One id can have more than 1 fruit entry. For example id 2 has information for 2 fruits mango and banana

print(sample_df)

                                              fruits  id
0  [{'name': 'mango', 'cost': 100, 'color': 'yell...   1
1  [{'name': 'mango', 'cost': 150, 'color': 'yell...   2

In the output I would like to have 3 records, 1 record with fruit information for id 1 and 2 records for fruit information for id 2

Is there a way to parse this structure using pandas ?

BENY · Accepted Answer · 2019-04-30T03:15:04.357

2

First unnesting your columns , then concat the values after called DataFrame

s=unnesting(sample_df,['fruits']).reset_index(drop=True)

df=pd.concat([s.drop('fruits',1),pd.DataFrame(s.fruits.tolist())],axis=1)
df
Out[149]: 
   id   color  cost    name  size
0   1  yellow   100   mango    12
1   2  yellow   150   mango    21
2   2   green   200  banana    10

def unnesting(df, explode):
    idx = df.index.repeat(df[explode[0]].str.len())
    df1 = pd.concat([
        pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
    df1.index = idx

    return df1.join(df.drop(explode, 1), how='left')

Method 2

sample_df.set_index('id').fruits.apply(pd.Series).stack().apply(pd.Series).reset_index(level=0)
Out[159]: 
   id   color  cost    name  size
0   1  yellow   100   mango    12
0   2  yellow   150   mango    21
1   2   green   200  banana    10

edited Apr 30 '19 at 03:15

answered Apr 30 '19 at 02:58

BENY

317,841
20
164
234

thank you for the answer. When i apply this to real data, which is messy, it returns me an error - `ValueError: zero-dimensional arrays cannot be concatenated` . I dont know how to best resolve this. – Regressor Apr 30 '19 at 03:10
@Regressor filter before running`unnesting` `sample_df =sample_df [sample_df.fruits.astype(bool)].copy()` – BENY Apr 30 '19 at 03:12
@Regressor ok , let us try the another way , I will update – BENY Apr 30 '19 at 03:13
okay let me try running that new method on the actual dataset. – Regressor Apr 30 '19 at 03:16
when i try the other method on actual dataset, it returns me the `id` and the json as is. – Regressor Apr 30 '19 at 03:20
@Regressor try both my method with the sample dataframe you provided to us , if that work , pleaqse find the difference between this and your real data (**Your question can not reproduced your problem**) – BENY Apr 30 '19 at 03:28
in Method 2, does `set_index()` require an `int` datatype? – Regressor Apr 30 '19 at 03:36
@Regressor nope – BENY Apr 30 '19 at 03:38

Flatten json to get multiple columns in Pandas

1 Answers1