1

I have a df that looks like this

            COL1   COL2    COL3                             
0         ABC      N       [{u'PERF_PCT': 0.2,u'PERF_DATE': 12},{u'PERF_PCT': 0.3,u'PERF_DATE': 13}]
1         XYZ      N       [{u'PERF_PCT': 0.6,u'PERF_DATE': 12},{u'PERF_PCT': 0.2,u'PERF_DATE': 13},{u'PERF_PCT': 0.7,u'PERF_DATE': 14}] 

I need to unpack column 3 such that the dataframe has additional rows for each dictionary in the list in COL3 and additional columns for each key in the dict. The key-value pairs in the dict are fixed.

          COL1    COL2     PERF_PCT       PERF_DATE              
0         ABC      N       0.2            12
1         ABC      N       0.3            13
2         XYZ      N       0.6            12 
3         XYZ      N       0.2            13
4         XYZ      N       0.7            14 

I have been able to accomplish the same using a for loop but I need something extremely performant. The resultant df can have around 170k records and currently using the for loop it takes about 20+ seconds which is unacceptable. I am hoping using pandas specific apply or other functions can make this faster but have not been able to do so. If there's an extremely fast way to do this I'll be really grateful. Thanks.

Fizi
  • 1,749
  • 4
  • 29
  • 55

1 Answers1

5

Try this :)

Idx=df.set_index(['COL1','COL2']).COL3.apply(pd.Series).stack().index

pd.DataFrame(df.set_index(['COL1','COL2']).COL3.apply(pd.Series).stack().values.tolist(),index=Idx).reset_index().drop('level_2',1)

Out[318]: 
  COL1 COL2  PERF_DATE  PERF_PCT
0  ABC    N         12       0.2
1  ABC    N         13       0.3
2  XYZ    N         12       0.6
3  XYZ    N         13       0.2
4  XYZ    N         14       0.7

Data Input

df = pd.DataFrame({'COL1':['ABC','XYZ'],
              'COL2': ['N','N'],
               'COL3' :[[{u'PERF_PCT': 0.2,u'PERF_DATE': 12},{u'PERF_PCT': 0.3,u'PERF_DATE': 13}],[{u'PERF_PCT': 0.6,u'PERF_DATE': 12},{u'PERF_PCT': 0.2,u'PERF_DATE': 13},{u'PERF_PCT': 0.7,u'PERF_DATE': 14}]]   })
BENY
  • 317,841
  • 20
  • 164
  • 234
  • Hi. Thanks for the help. Is this dependent on the number of columns in df. Iam getting ValueError: Length mismatch: Expected axis has 2 elements, new values have 5 elements. I have more columns than just COL1 and COL2. And there are 5 key-value pairs in the dictionary – Fizi Sep 14 '17 at 15:19
  • Nevermind! It works great. Only thing that would chnage is the column level_* Is it possible to go a little in depth into what you are doing. Its really cool and I would love to understand it – Fizi Sep 14 '17 at 15:24
  • @Fizi I called it unlistify or unnest question , you can check here,https://stackoverflow.com/questions/45885143/explode-lists-with-different-lengths-in-pandas/45885337#comment78730894_45885337 – BENY Sep 14 '17 at 16:45
  • what if any of the columns that I don't want to expand are a list? Then I get `TypeError: unhashable type: 'list'` – Nickpick Jul 24 '19 at 10:00