2

I'm working with Python 3.6.9.

I'm stuck on a dataframe like that :

import pandas as pd
import numpy as np

dict_ = {'col1': [3.14, 28, -0.618, 1.159], 'col2': ['a_002_u', 'a_003_u', 'a_001_u', 'a_003_u'], 'a_001_u': [np.nan] * 4, 'a_002_u': [np.nan] * 4, 'a_003_u': [np.nan] * 4}

df = pd.DataFrame(dict_)

     col1     col2  a_001_u  a_002_u  a_003_u
0   3.140  a_002_u      NaN      NaN      NaN
1  28.000  a_003_u      NaN      NaN      NaN
2  -0.618  a_001_u      NaN      NaN      NaN
3   1.159  a_003_u      NaN      NaN      NaN

And I would like to get this result :

     col1     col2  a_001_u  a_002_u  a_003_u
0   3.140  a_002_u      NaN     3.14      NaN
1  28.000  a_003_u      NaN      NaN   28.000
2  -0.618  a_001_u   -0.618      NaN      NaN
3   1.159  a_003_u      NaN      NaN    1.159

In other words, I would like to fill columns 'a_001_u', 'a_002_u', and 'a_003_u' with 'col1' values based on column header in 'col2'.

It is quite easy to explain, but I have the impression that it is less obvious to set up. Does anyone have an idea to help me?

Ben.T
  • 29,160
  • 6
  • 32
  • 54
David
  • 450
  • 3
  • 14

3 Answers3

2

you can use fillna after change the shape of the first 2 column with set_index and unstack like:

df = df.fillna(df.set_index('col2', append=True)['col1'].unstack())
print (df)
     col1     col2  a_001_u  a_002_u  a_003_u
0   3.140  a_002_u      NaN     3.14      NaN
1  28.000  a_003_u      NaN      NaN   28.000
2  -0.618  a_001_u   -0.618      NaN      NaN
3   1.159  a_003_u      NaN      NaN    1.159

because actually doing the set_index and unstack does create the other columns you want, and the fillna will do a (row,column) filling of missing value

print(df.set_index('col2', append=True)['col1'].unstack())
col2  a_001_u  a_002_u  a_003_u
0         NaN     3.14      NaN
1         NaN      NaN   28.000
2      -0.618      NaN      NaN
3         NaN      NaN    1.159

NB: Same result could be obtain with pivot like df.pivot(columns='col2', values='col1')

Ben.T
  • 29,160
  • 6
  • 32
  • 54
  • 1
    This solution works perfectly. And that's exactly what I was looking for. I did not know the association "set_index" and "unstack". Thanks a lot ! @Ben.T – David May 03 '20 at 18:49
  • 2
    @David it is equivalent to a pivot, you could get the same result with `df.pivot(columns='col2', values='col1')`, which could be faster actually ;) – Ben.T May 03 '20 at 18:54
  • 2
    Totally agree with @Ben.T , pivot is a good tool to use in this case given you have a a large number of rows. –  May 03 '20 at 19:01
0

You can write it by iterating through the rows.

for index, row in df.iterrows():
    row[row['col2']]=row['col1']
  • I have tens of thousands of data frames, and each one contains several thousand rows. So I'm looking for (if possible) a solution by vector calculation to be as fast as possible. I'm afraid that with a loop for it would take too long... Thanks for your help @overflow95 – David May 03 '20 at 18:43
  • 1
    I get an error 'TypeError: cannot do label indexing on with these indexers [3.14] of ' – Sowmya May 03 '20 at 18:51
  • Hi @Sowmya, thanks for noticing. It should be `row['col1']` instead of `row[row['col1']]` –  May 03 '20 at 19:00
  • @overflow95 You are welcome. I was wondering why the loop should take long? – Sowmya May 03 '20 at 19:13
  • 1
    @Sowmya for more information about `iterrows` vs `vectorization`, read this [answer](https://stackoverflow.com/a/55557758/9274732), it is really good :) – Ben.T May 03 '20 at 19:31
0
import pandas as pd
import numpy as np

dict_ = {'col1': [3.14, 28, -0.618, 1.159], 'col2': ['a_002_u', 'a_003_u', 'a_001_u', 'a_003_u'], 'a_001_u': [np.nan] * 4, 'a_002_u': [np.nan] * 4, 'a_003_u': [np.nan] * 4}

df = pd.DataFrame(dict_)
count = 0
for key in df['col2']:
    df[key][count] = df['col1'][count]
    count += 1

df
Sowmya
  • 91
  • 1
  • 9