0

Each column col1, col2 and col3 need to have either a forward or a backward fill conditional on column col4.

Say I have a dataframe like so:

df = pd.DataFrame({'col1':[1,np.nan,3, np.nan,5,np.nan], 
                     'col2':[7, np.nan, 9, np.nan, 11, np.nan], 
                     'col3':[13, 14, 15, 16, np.nan,18], 
                     'col4':[2015, 2015, 2015, 2016, 2016, 2018]}) 

   col1  col2  col3  col4
0   1.0   7.0  13.0  2015
1   NaN   NaN  14.0  2015
2   3.0   9.0  15.0  2015
3   NaN   NaN  16.0  2016
4   5.0  11.0   NaN  2016
5   NaN   NaN  18.0  2018

I am trying to implement this through groupby.

grouped = df.groupby('col4')

Then I loop over groups and if a group name meets a condition I do a fill (backward or forward) and update the dataframe.

for name, group in grouped:
       if name == 2015:
              df[df.col4==name][['col1', 'col2']] = grouped.get_group(name)[['col1', 'col2']].ffill(axis=0)
       elif name == 2016: 
              df[df.col4==name]['col1'] = grouped.get_group(name)['col1'].ffill(axis=0)
              df[df.col4==name][['col2', 'col3']] = grouped.get_group(name)[['col1', 'col2']].bfill(axis=0)
       else: 
              df[df.col4==name]['col1', 'col2', 'col3'] = grouped.get_group(name)['col1'].bfill(axis=0)



However this does not work and looks very lengthy.
This post looks similar.

I would appreciate any suggestions.

olyashevska
  • 427
  • 4
  • 18

2 Answers2

1

Following all advices, I solved this problem as following:

for name, group in grouped:
       if name == 2015:
              df.loc[df.col4==name, ['col1', 'col2']] = grouped.get_group(name)[['col1', 'col2']].ffill(axis=0)
       elif name == 2016: 
              df.loc[df.col4==name, ['col1']] = grouped.get_group(name)['col1'].ffill(axis=0)
              df.loc[df.col4==name, ['col2', 'col3']] = grouped.get_group(name)[['col1', 'col2']].bfill(axis=0)
       else: 
              df[df.col4==name, ['col1', 'col2', 'col3']] = grouped.get_group(name)['col1'].bfill(axis=0)
olyashevska
  • 427
  • 4
  • 18
0

Your question isnt clear. For instance you didnt account for 2018 in your loop. What do we do with it. If only 2015 and 2016 is what you need. Please Try

m=df.col4 ==2015#Boolean select
df.loc[m,'col1':'col3']=df.loc[m,'col1':'col3'].fillna(method='ffill')#Forward Fill
df.loc[~m,'col1':'col3']=df.loc[~m,'col1':'col3'].fillna(method='bfill').fillna(method='ffill')



col1  col2  col3  col4
0   1.0   7.0  13.0  2015
1   1.0   7.0  14.0  2015
2   3.0   9.0  15.0  2015
3   5.0  11.0  16.0  2016
4   5.0  11.0  18.0  2016
5   5.0  11.0  18.0  2018
wwnde
  • 26,119
  • 6
  • 18
  • 32
  • Thanks! You are right, I added an additional statement for 2018. In fact I have many values in col4, this is just an example. This is a survey, and observers were filling only starting or ending values in some years. This should explain background of the problem. Your answer is almost correct, if your make changes to account for different levels of grouped variable (a loop?), I will accept it as a correct one. Thanks for the help again! – olyashevska Nov 04 '20 at 07:49
  • Please Try `df[['col1', 'col2']]=df.groupby((df.col1.notna()|df.col2.notna()).cumsum())[['col1', 'col2']].fillna(method='ffill')` This is as you explained. What about the case in `col3`that is `NaN`. That seems not captured in your explanation. – wwnde Nov 04 '20 at 08:48