Extract dictionary value from column in data frame

Question

I'm looking for a way to optimize my code.

I have entry data in this form:

import pandas as pn

a=[{'Feature1': 'aa1','Feature2': 'bb1','Feature3': 'cc2' },
 {'Feature1': 'aa2','Feature2': 'bb2' },
 {'Feature1': 'aa1','Feature2': 'cc1' }
 ]
b=['num1','num2','num3']


df= pn.DataFrame({'num':b, 'dic':a })

I would like to extract element 'Feature3' from dictionaries in column 'dic'(if exist) in above data frame. So far I was able to solve it but I don't know if this is the fastest way, it seems to be a little bit over complicated.

Feature3=[]
for idx, row in df['dic'].iteritems():
    l=row.keys()

    if 'Feature3' in l:
        Feature3.append(row['Feature3'])
    else:
        Feature3.append(None)

df['Feature3']=Feature3
print df

Is there a better/faster/simpler way do extract this Feature3 to separate column in the dataframe?

Thank you in advance for help.

There is no vectorised method to check for this as you're storing non-scalar values in your df, this is ill-advised as it it makes filtering and lookups difficult as you've found — EdChum, Feb 29 '16 at 22:35

Alexander · Accepted Answer · 2018-10-19T17:43:08.490

35

You can use a list comprehension to extract feature 3 from each row in your dataframe, returning a list.

feature3 = [d.get('Feature3') for d in df.dic]

If 'Feature3' is not in dic, it returns None by default.

You don't even need pandas, as you can again use a list comprehension to extract the feature from your original dictionary a.

feature3 = [d.get('Feature3') for d in a]

edited Oct 19 '18 at 17:43

answered Feb 29 '16 at 22:57

Alexander

105,104
32
201
196

this is certainly a very "pythonic" way to do it ... and outperforms the pandas solutions by an order of magnitude – maxymoo Feb 29 '16 at 23:00

score 19 · Answer 2 · answered Mar 01 '16 at 01:34

19

df['Feature3'] = df['dic'].apply(lambda x: x.get('Feature3'))

Agree with maxymoo. Consider changing the format of your dataframe.

(Sidenote: pandas is generally imported as pd)

answered Mar 01 '16 at 01:34

as133

199
3

3

Doesnt work for me, did with ['key_name'] to get the value – M. Mariscal Feb 26 '20 at 10:04

score 16 · Answer 3 · answered Feb 29 '16 at 22:42

16

If you apply a Series, you get a quite nice DataFrame:

>>> df.dic.apply(pn.Series)
    Feature1    Feature2    Feature3
0   aa1 bb1 cc2
1   aa2 bb2 NaN
2   aa1 cc1 NaN

From this point, you can just use regular pandas operations.

answered Feb 29 '16 at 22:42

Ami Tavory

74,578
11
141
185

score 6 · Answer 4 · answered Apr 07 '22 at 12:18

6

There is now a vectorial method, you can use the str accessor:

df['dic'].str['Feature3']

Or with str.get

df['dic'].get('Feature3')

output:

0     cc2
1    None
2    None
Name: dic, dtype: object

answered Apr 07 '22 at 12:18

mozway

194,879
13
39
75

This is a simple solution that works great. Additional .str operations can be added in to access additional levels of a dictionary if needed, e.g. df['dic'].str['Feature3'].str['Feature_Within_Feature3'] – KBurchfiel Aug 22 '23 at 16:36

jezrael · Answer 5 · 2016-02-29T22:46:17.127

I think you can first create new DataFrame by comprehension and then create new column like:

df1 = pd.DataFrame([x for x in df['dic']])
print df1
  Feature1 Feature2 Feature3
0      aa1      bb1      cc2
1      aa2      bb2      NaN
2      aa1      cc1      NaN

df['Feature3'] = df1['Feature3']
print df
                                                 dic   num Feature3
0  {u'Feature2': u'bb1', u'Feature3': u'cc2', u'F...  num1      cc2
1         {u'Feature2': u'bb2', u'Feature1': u'aa2'}  num2      NaN
2         {u'Feature2': u'cc1', u'Feature1': u'aa1'}  num3      NaN

Or one line:

df['Feature3'] = pd.DataFrame([x for x in df['dic']])['Feature3']
print df
                                                 dic   num Feature3
0  {u'Feature2': u'bb1', u'Feature3': u'cc2', u'F...  num1      cc2
1         {u'Feature2': u'bb2', u'Feature1': u'aa2'}  num2      NaN
2         {u'Feature2': u'cc1', u'Feature1': u'aa1'}  num3      NaN

Timings:

len(df) = 3:

In [24]: %timeit pd.DataFrame([x for x in df['dic']])
The slowest run took 4.63 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 596 µs per loop

In [25]: %timeit df.dic.apply(pn.Series)
1000 loops, best of 3: 1.43 ms per loop

len(df) = 3000:

In [27]: %timeit pd.DataFrame([x for x in df['dic']])
100 loops, best of 3: 3.16 ms per loop

In [28]: %timeit df.dic.apply(pn.Series)
1 loops, best of 3: 748 ms per loop

score 1 · Answer 6 · answered Feb 29 '16 at 22:54

I think you're thinking about the data structures slightly wrong. It's better to create the data frame with the features as columns from the start; pandas is actually smart enough to do this by default:

In [240]: pd.DataFrame(a)
Out[240]:
  Feature1 Feature2 Feature3
0      aa1      bb1      cc2
1      aa2      bb2      NaN
2      aa1      cc1      NaN

You would then add on your "num" column in a separate step, since the data is in a different orientation, either with

df['num'] = b

or

df = df.assign(num = b)

(I prefer the second option since it's got a more functional flavour).

score 1 · Answer 7 · answered Jan 12 '21 at 10:54

1

df = pd.concat([df, pd.DataFrame(list(df['dic']))], axis=1)

Then do whatever you want with the result, if a key was missing at one place you will get NaN there.

answered Jan 12 '21 at 10:54

hk_03

192
3
12

Extract dictionary value from column in data frame

7 Answers7

Linked

Related