6

I have a dataframe df with NaN values and I want to dynamically replace them with the average values of previous and next non-missing values.

In [27]: df 
Out[27]: 
          A         B         C
0 -0.166919  0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3       NaN -2.027325  1.533582
4       NaN       NaN  0.461821
5 -0.788073       NaN       NaN
6 -0.916080 -0.612343       NaN
7 -0.887858  1.033826       NaN
8  1.948430  1.025011 -2.982224
9  0.019698 -0.795876 -0.046431

For example, A[3] is NaN so its value should be (-0.120211-0.788073)/2 = -0.454142. A[4] then should be (-0.454142-0.788073)/2 = -0.621108.

Therefore, the result dataframe should look like:

In [27]: df 
Out[27]: 
          A         B         C
0 -0.166919  0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 -0.454142 -2.027325  1.533582
4 -0.621108 -1.319834  0.461821
5 -0.788073 -0.966089 -1.260202
6 -0.916080 -0.612343 -2.121213
7 -0.887858  1.033826 -2.551718
8  1.948430  1.025011 -2.982224
9  0.019698 -0.795876 -0.046431

Is this a good way to deal with the missing values? I can't simply replace them by the average values of each column because my data is time-series and tends to increase over time. (The initial value may be $0 and final value might be $100000, so the average is $50000 which can be much bigger/smaller than the NaN values).

Jun Seong Jang
  • 385
  • 2
  • 4
  • 12
  • 4
    Are you really attached to the formula you gave to fill `NaN` or you just want to have a value close to the other before and after. Try `df.interpolate()`, it will fill the `NaN` with value around the one you look for, but not with the exact value you calculate with your formula – Ben.T Jul 27 '18 at 14:03

3 Answers3

2

You can try to understand your logic behind the average that is Geometric progression

s=df.isnull().cumsum()
t1=df[(s==1).shift(-1).fillna(False)].stack().reset_index(level=0,drop=True)
t2=df.lookup(s.idxmax()+1,s.idxmax().index)
df.fillna(t1/(2**s)+t2*(1-0.5**s)*2/2)
Out[212]: 
          A         B         C
0 -0.166919  0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 -0.454142 -2.027325  1.533582
4 -0.621107 -1.319834  0.461821
5 -0.788073 -0.966089 -1.260201
6 -0.916080 -0.612343 -2.121213
7 -0.887858  1.033826 -2.551718
8  1.948430  1.025011 -2.982224
9  0.019698 -0.795876 -0.046431

Explanation:

1st NaN x/2+y/2=1st

2nd NaN 1st/2+y/2=2nd

3rd NaN 2nd/2+y/2+3rd

Then x/(2**n)+y(1-(1/2)**n)/(1-1/2), this is the key

BENY
  • 317,841
  • 20
  • 164
  • 234
  • Does this would work if you have another group of `NaN` in the same column? for example if row 7 and 8 in column A were also `NaN`? I know it's not in the question, just wondering :) – Ben.T Jul 27 '18 at 14:25
0

Got a simular Problem. The following code worked for me.

def fill_nan_with_mean_from_prev_and_next(df):
        NANrows = pd.isnull(df).any(1).nonzero()[0]
        null_df = df.isnull()
        for row in NANrows :
            for colum in range(0,df.shape[1]):
                if(null_df.iloc[row][colum]):
                    df.iloc[row][colum] = (df.iloc[row-1][colum]+df.iloc[row-1][colum])/2

        return df

maybe it is helps someone too.

Björn P
  • 1
  • 1
0

as Ben.T has mentioned above

if you have another group of NaN in the same column

you can consider this lazy solution :)

for column in df:
    for ind,row in df[[column]].iterrows():
        if ~np.isnan(row[column]):
            previous = row[column]
        else:
            indx = ind + 1
            while np.isnan(df.loc[indx,column]):
                indx += 1
            next = df.loc[indx,column]
            previous = df[column][ind] = (previous + next)/2