1

I have the following dataframe, and I want to replace nan with a certain value, let's say, 0.0001, only if there is a value right to the missing value.

ID 2021_12        2021_09     2021_06    2021_03      2020_12     2020_09
A 0.020637713       nan         nan         nan         nan         nan
B 0.020637717   0.020637717 0.020637717 0.020637717     nan     0.007053725
C 0.020637723   0.020637723 0.020637723 0.020637723 0.020637723 0.020637723
D 0.020637729       nan         nan         nan         nan         nan
E 0.020637733   0.02504525  0.02504525      nan         nan         nan     
F 0.020637735   0.020637735 0.020637735     nan         nan         nan 
G 0.020637735       nan         nan     0.017617495     nan         nan

desired output

ID 2021_12        2021_09     2021_06    2021_03      2020_12     2020_09
A 0.020637713       nan         nan         nan         nan         nan
B 0.020637717   0.020637717 0.020637717 0.020637717    0.0001   0.007053725
C 0.020637723   0.020637723 0.020637723 0.020637723 0.020637723 0.020637723
D 0.020637729       nan         nan         nan         nan         nan
E 0.020637733   0.02504525  0.02504525      nan         nan         nan     
F 0.020637735   0.020637735 0.020637735     nan         nan         nan 
G 0.020637735      0.0001      0.0001   0.017617495     nan         nan
Sunderam Dubey
  • 1
  • 11
  • 20
  • 40
Olive
  • 644
  • 4
  • 12
  • Have you tried [`df.isna()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isna.html#pandas.DataFrame.isna)? How about [`df.fillna()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html)? Maybe a solution here will help: https://stackoverflow.com/questions/29530232/how-to-check-if-any-value-is-nan-in-a-pandas-dataframe. – Raisin Apr 28 '22 at 13:43

1 Answers1

2

Use DataFrame.mask with mask for test not missing values after back filling missing values chained for test missing values:

#if misisng values are strings
df = df.replace('nan', np.nan)

df = df.mask(df.bfill(axis=1).notna() & df.isna(), 0.001)
print (df)
  ID   2021_12   2021_09   2021_06   2021_03   2020_12   2020_09
0  A  0.020638       NaN       NaN       NaN       NaN       NaN
1  B  0.020638  0.020638  0.020638  0.020638  0.001000  0.007054
2  C  0.020638  0.020638  0.020638  0.020638  0.020638  0.020638
3  D  0.020638       NaN       NaN       NaN       NaN       NaN
4  E  0.020638  0.025045  0.025045       NaN       NaN       NaN
5  F  0.020638  0.020638  0.020638       NaN       NaN       NaN
6  G  0.020638  0.001000  0.001000  0.017617       NaN       NaN

Solution for improve performance - testing not missing values, swap order of columns with cumulative sum with test if values greater like 0, last swap to original order in numpy:

#if missing values are strings
df = df.replace('nan', np.nan)

m = df.notna()
m1 = (np.cumsum(m.to_numpy()[:, ::-1], axis=1) > 0)[:, ::-1]
df = df.mask(m1 & ~m, 0.001)
print (df)

  ID   2021_12   2021_09   2021_06   2021_03   2020_12   2020_09
0  A  0.020638       NaN       NaN       NaN       NaN       NaN
1  B  0.020638  0.020638  0.020638  0.020638  0.001000  0.007054
2  C  0.020638  0.020638  0.020638  0.020638  0.020638  0.020638
3  D  0.020638       NaN       NaN       NaN       NaN       NaN
4  E  0.020638  0.025045  0.025045       NaN       NaN       NaN
5  F  0.020638  0.020638  0.020638       NaN       NaN       NaN
6  G  0.020638  0.001000  0.001000  0.017617       NaN       NaN
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • do you know why "bfill" takes a long time on a big dataset?? my dataset is 15m x 26, and it is taking forever.. – Olive Apr 28 '22 at 13:10