1

I have a dataframe that has a column called actions with a list of dictionaries. The format is {source:int, action:string} and i need to parse it into new columns for each, but the number of records in each action cell is variable.

the data looks like this:

|Id  |action                                                     |
|1   |[{"E": 4, "action": "views"}, {"A": 58, "action": "views"}]|
|2   |[{"A": 74, "action": "clicks"}]                            |

and I would like it to look like this:

|Id|Source|Value|Action|
|1 |E     |4    |views |
|1 |A     |58   |views |
|2 |A     |74   |clicks|

The number of dictionaries in the action column can be up to 10

I've tried a few solutions like this one pandas DataFrame: normalize one JSON column and merge with other columns

but it tells me that DataFrame was not called properly for the first solution and that str has no attribute value for the second one. Even beyond that it's not quite the solution i need because i need to rename a column source and put the A/E/etc value in it.

DBA108642
  • 1,995
  • 1
  • 18
  • 55
  • Why do you have a df with dictionaries in it in the first place? Are you in control of that step? This problem is better-fixed upstream – roganjosh May 11 '19 at 15:00
  • I don't have control over it, the data was supplied as a csv with the columns like that. – DBA108642 May 11 '19 at 15:01
  • I'd use the `csv` module and try re-jig the data with that before creating a dataframe tbh – roganjosh May 11 '19 at 15:03

3 Answers3

2

I will using unnesting

yourdf=unnesting(df,['action']).reset_index(drop=True)
s=pd.DataFrame(yourdf.action.tolist()).reset_index().melt(['action','index']).dropna().set_index('index')

yourdf=pd.concat([yourdf,s],axis = 1)
yourdf
                          action  ID  action variable  value
0    {'E': 4, 'action': 'views'}   1   views        E    4.0
1   {'A': 58, 'action': 'views'}   1   views        A   58.0
2  {'A': 74, 'action': 'clicks'}   2  clicks        A   74.0

def unnesting(df, explode):
    idx = df.index.repeat(df[explode[0]].str.len())
    df1 = pd.concat([
        pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
    df1.index = idx

    return df1.join(df.drop(explode, 1), how='left')
BENY
  • 317,841
  • 20
  • 164
  • 234
  • This looks good, but im getting the following error: Cannot cast array data from dtype('int64') to dtype('int32') according to the rule 'safe' – DBA108642 May 11 '19 at 16:01
1

I did mine the long way but it should give you the output you desire. I just unnest the dictionaries inside the column of the dataframe. This only works if the dictionaries only take the format you show in the example dataframe.

import pandas as pd

df = pd.DataFrame({'id': [1,2],
                   'action': [[{"E": 4, "action": "views"}, 
                               {"A": 58, "action": "views"}], 
                              [{"A": 74, "action": "clicks"}]]}
                 )

df_list = []
cols = ['id', 'source', 'value', 'action']
for x in df.index:
    for actions in df.loc[x, 'action']:
        row = []
        row.append(df.loc[x,'id'])
        for k,v in actions.items():
            if k == 'action':
                row.append(v)
            else:
                row.append(k)
                row.append(v)    
        df_list.append(row)
test = pd.DataFrame(df_list, columns=cols)

Which gives this output

   id source  value  action
0   1      E      4   views
1   1      A     58   views
2   2      A     74  clicks
Matthew Barlowe
  • 2,229
  • 1
  • 14
  • 24
  • This works with the sample data but when I apply it to my whole dataset i get the error "'str' has no attribute 'items'" on the line for k,v in actions.items(): – DBA108642 May 11 '19 at 16:22
  • It means there’s an action column roe that isn’t a list of dictionaries. The code assumes all the data will be in the format in the example – Matthew Barlowe May 11 '19 at 16:25
  • Right, but looking at my data everything is a list of dictionaries. – DBA108642 May 11 '19 at 16:44
1

Just rebuild your DataFrame data using list comprehension:

UPDATE: so the action column is a string field, not a Python data structure. You can pre-process it using json.loads which will be less error-prone in terms of parsing data structures from a string.

import json

# original dataframe
df = pd.DataFrame([
        (1, '[{"E": 4, "action": "views"}, {"A": 58, "action": "views"}]') 
      , (2, '[{"A": 74, "action": "clicks"}]') 
    ], 
    columns=['Id', 'action']
)

# new dataframe
df_new = pd.DataFrame(
    [ (id, k, v, i['action']) for id,act in zip(df.Id, df.action.apply(json.loads))
                              for i in act
                              for k,v in i.items() if k != 'action' 
    ],
    columns=['Id', 'Source', 'Value', 'Action']
)

print(new_df)
#   Id Source  Value  Action
#0   1      E      4   views
#1   1      A     58   views
#2   2      A     74  clicks
jxc
  • 13,553
  • 4
  • 16
  • 34
  • Im still getting the error str object has no attribute items which makes no sense because all of the data is in that format – DBA108642 May 11 '19 at 18:35
  • @DBA108642, did not notice that action is a string field, you can parse it using `json.loads` before processing the Python data structure. – jxc May 11 '19 at 20:31
  • well it's technically an object type. i've tried doing a json.loads on the column but it throws the error that you can't pass a series to that function – DBA108642 May 11 '19 at 21:03
  • @DBA108642, using `df.action.apply(json.loads)`, can you try my updated answer. – jxc May 11 '19 at 21:06
  • I literally love you so gd much – DBA108642 May 11 '19 at 21:15