27

I'm looking for a way to optimize my code.

I have entry data in this form:

import pandas as pn

a=[{'Feature1': 'aa1','Feature2': 'bb1','Feature3': 'cc2' },
 {'Feature1': 'aa2','Feature2': 'bb2' },
 {'Feature1': 'aa1','Feature2': 'cc1' }
 ]
b=['num1','num2','num3']


df= pn.DataFrame({'num':b, 'dic':a })

I would like to extract element 'Feature3' from dictionaries in column 'dic'(if exist) in above data frame. So far I was able to solve it but I don't know if this is the fastest way, it seems to be a little bit over complicated.

Feature3=[]
for idx, row in df['dic'].iteritems():
    l=row.keys()

    if 'Feature3' in l:
        Feature3.append(row['Feature3'])
    else:
        Feature3.append(None)

df['Feature3']=Feature3
print df

Is there a better/faster/simpler way do extract this Feature3 to separate column in the dataframe?

Thank you in advance for help.

michalk
  • 1,487
  • 3
  • 16
  • 21
  • 2
    There is no vectorised method to check for this as you're storing non-scalar values in your df, this is ill-advised as it it makes filtering and lookups difficult as you've found – EdChum Feb 29 '16 at 22:35

7 Answers7

35

You can use a list comprehension to extract feature 3 from each row in your dataframe, returning a list.

feature3 = [d.get('Feature3') for d in df.dic]

If 'Feature3' is not in dic, it returns None by default.

You don't even need pandas, as you can again use a list comprehension to extract the feature from your original dictionary a.

feature3 = [d.get('Feature3') for d in a]
Alexander
  • 105,104
  • 32
  • 201
  • 196
  • this is certainly a very "pythonic" way to do it ... and outperforms the pandas solutions by an order of magnitude – maxymoo Feb 29 '16 at 23:00
19
df['Feature3'] = df['dic'].apply(lambda x: x.get('Feature3'))

Agree with maxymoo. Consider changing the format of your dataframe.

(Sidenote: pandas is generally imported as pd)

as133
  • 199
  • 3
16

If you apply a Series, you get a quite nice DataFrame:

>>> df.dic.apply(pn.Series)
    Feature1    Feature2    Feature3
0   aa1 bb1 cc2
1   aa2 bb2 NaN
2   aa1 cc1 NaN

From this point, you can just use regular pandas operations.

Ami Tavory
  • 74,578
  • 11
  • 141
  • 185
6

There is now a vectorial method, you can use the str accessor:

df['dic'].str['Feature3']

Or with str.get

df['dic'].get('Feature3')

output:

0     cc2
1    None
2    None
Name: dic, dtype: object
mozway
  • 194,879
  • 13
  • 39
  • 75
  • This is a simple solution that works great. Additional .str operations can be added in to access additional levels of a dictionary if needed, e.g. df['dic'].str['Feature3'].str['Feature_Within_Feature3'] – KBurchfiel Aug 22 '23 at 16:36
4

I think you can first create new DataFrame by comprehension and then create new column like:

df1 = pd.DataFrame([x for x in df['dic']])
print df1
  Feature1 Feature2 Feature3
0      aa1      bb1      cc2
1      aa2      bb2      NaN
2      aa1      cc1      NaN

df['Feature3'] = df1['Feature3']
print df
                                                 dic   num Feature3
0  {u'Feature2': u'bb1', u'Feature3': u'cc2', u'F...  num1      cc2
1         {u'Feature2': u'bb2', u'Feature1': u'aa2'}  num2      NaN
2         {u'Feature2': u'cc1', u'Feature1': u'aa1'}  num3      NaN

Or one line:

df['Feature3'] = pd.DataFrame([x for x in df['dic']])['Feature3']
print df
                                                 dic   num Feature3
0  {u'Feature2': u'bb1', u'Feature3': u'cc2', u'F...  num1      cc2
1         {u'Feature2': u'bb2', u'Feature1': u'aa2'}  num2      NaN
2         {u'Feature2': u'cc1', u'Feature1': u'aa1'}  num3      NaN

Timings:

len(df) = 3:

In [24]: %timeit pd.DataFrame([x for x in df['dic']])
The slowest run took 4.63 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 596 µs per loop

In [25]: %timeit df.dic.apply(pn.Series)
1000 loops, best of 3: 1.43 ms per loop

len(df) = 3000:

In [27]: %timeit pd.DataFrame([x for x in df['dic']])
100 loops, best of 3: 3.16 ms per loop

In [28]: %timeit df.dic.apply(pn.Series)
1 loops, best of 3: 748 ms per loop
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
1

I think you're thinking about the data structures slightly wrong. It's better to create the data frame with the features as columns from the start; pandas is actually smart enough to do this by default:

In [240]: pd.DataFrame(a)
Out[240]:
  Feature1 Feature2 Feature3
0      aa1      bb1      cc2
1      aa2      bb2      NaN
2      aa1      cc1      NaN

You would then add on your "num" column in a separate step, since the data is in a different orientation, either with

df['num'] = b

or

df = df.assign(num = b)

(I prefer the second option since it's got a more functional flavour).

maxymoo
  • 35,286
  • 11
  • 92
  • 119
1

df = pd.concat([df, pd.DataFrame(list(df['dic']))], axis=1)

Then do whatever you want with the result, if a key was missing at one place you will get NaN there.

hk_03
  • 192
  • 3
  • 12