0

I'm currently trying to pull the value out of several dicts appearing in a series of columns, there are two issues:

Since there are 4 columns in question they were unpacked from a previous dict-in-column value via this line of code:

df = pd.concat([df.drop(['ids'], axis = 1), df['ids'].apply(pd.Series)], axis = 1)

What this dict was unpack a dict in a column of the form:

d = {'a': {'id': 12}, 'b': {'id': 13}, 'c': {'id': 14}, 'd': {'id': 15}}

The dict d being of length between 0-4.

Before unpacking the dataframe the column I unpacked looked like this:

       ids
406    {'a': {'id': '12'}}
408    None
409    {'a': {'id': '21'}, 'b': {'id': '23'}}
417    {'a': {'id': '53'}, 'b': {'id': '98'}, 'c': {'id': '45'}}
419    None 

After Unpacking it now has the form

        a                        b                         c
408     None                     {'id': '12'}       None
409     {'id': '32'}             {'id': '45'}       {'id': '36'}
417     {'id': '09'}             {'id': '31'}       None

While that initially solved my first problem, I'm now trying to pull the values out of columns that have the dictionaries in them, and I'm kind of at a loss for this.

Potential solutions I've tried are just running the snippet above for each column (a,b,c), however that is both ugly and inefficient. At most I know an easy fix would be to pd.json_normalize the initial dataframe when I first start my program, however that would require a significant fix and refactor for something that seems that it could be solved trivially. For reference the ideal output would be this:

        a                        b                  c
408     None                     12                 None
409     32                       45                 36
417     09                       31                 None

And the whole dataframe is several hundred thousand rows, with 20 columns that are in flux.

Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
Sebastian Goslin
  • 477
  • 1
  • 3
  • 22

2 Answers2

1

One option is to apply a customer function to each column

def my_func(val):
    if isinstance(val,dict):
        return val['id']
    else:
        return val

for col in df.columns:
    df[col]=df[col].apply(my_func)

    a       b   c
0   None    12  None
1   32      45  36
2   09      31  None
G. Anderson
  • 5,815
  • 2
  • 14
  • 21
1
import pandas as pd

# setup dataframe
data = {'ids': [{'a': {'id': '12'}}, None, {'a': {'id': '21'}, 'b': {'id': '23'}}, {'a': {'id': '53'}, 'b': {'id': '98'}, 'c': {'id': '45'}}, None]}
df = pd.DataFrame(data)

# display(df)
                                                         ids
0                                        {'a': {'id': '12'}}
1                                                       None
2                     {'a': {'id': '21'}, 'b': {'id': '23'}}
3  {'a': {'id': '53'}, 'b': {'id': '98'}, 'c': {'id': '45'}}
4                                                       None

# fill None with {}
df.ids = df.ids.fillna({i: {} for i in df.index})

# normalize the column
df = pd.json_normalize(df.ids).dropna(how='all')

# display(df)
  a.id b.id c.id
0   12  NaN  NaN
2   21   23  NaN
3   53   98   45
Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158