1

i have this dataframe that have row of each key*id , i want to explode it to id,key1,key2 and remove duplicate rows and keep data_field , i am working with python2.7 but i would glad to a solution that will work both for python2.7 and python3.7

dataframe i have:

import  pandas as pd
d = {'id': [111, 222, 222, 333, 333], 'key': ['key1', 'key2','key1','key2','key1'], 'value':[1,1,2,3,3],'data_field':['dummy1','dummy1','dummy2','dummy3','dummy2']}
df = pd.DataFrame(data=d)
print df[['id','key','value','data_field']].to_string(index=False)

        id   key  value data_field
         111  key1      1     dummy1
         222  key2      1     dummy1
         222  key1      2     dummy2
         333  key2      3     dummy3
         333  key1      3     dummy2

dataframe i want it to be transformed to:

d = {'id': [111, 222, 333], 'key1': [1, 2, 3],'key2':[pd.np.nan,1,3] , 'data_field': ['dummy1', 'dummy2', 'dummy3']}
df = pd.DataFrame(data=d)
print df[['id', 'key1', 'key2', 'data_field']].to_string(index=False)
          
    id  key1  key2 data_field
         111     1   NaN     dummy1
         222     2   1.0     dummy2
         333     3   3.0     dummy3

tried as suggested here

df.pivot(index='id', columns='key', values='value').join(df.drop_duplicates('id')['data_field'])
and got  :
            key1  key2 data_field
id                        
111   1.0   NaN        NaN
222   2.0   1.0        NaN
333   3.0   3.0        NaN

data_field was not kept and id is now index and not column

Omer Anisfeld
  • 1,236
  • 12
  • 28

1 Answers1

0

Use DataFrame.pivot with DataFrame.join only first duplicated rows in data_field per id:

df = (df.pivot(index='id', columns='key', values='value')
       .join(df.set_index('id')['data_field'].drop_duplicates())
       .reset_index())

print (df)
    id  key1  key2 data_field
0  111   1.0   NaN     dummy1
1  222   2.0   1.0     dummy2
2  333   3.0   3.0     dummy3

Another idea for first data from data_field to new columns per id and key:

df = df.pivot_table(index='id',columns='key',values=['value','data_field'],aggfunc='first')
df.columns = df.columns.map('_'.join)
df = df.reset_index()

print (df)
    id data_field_key1 data_field_key2  value_key1  value_key2
0  111          dummy1             NaN         1.0         NaN
1  222          dummy2          dummy1         2.0         1.0
2  333          dummy2          dummy3         3.0         3.0
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252