1

I have a dataframe where the data progresses in one hour intervals but one of the columns has a few NaN values. When I encounter a NaN, I would like the code to average the four hours prior to the NaN and then replace the NaN with that average.

I tired modifying the answer to this question: pandas DataFrame: replace nan values with average of columns but this example is taking an average of the whole column and not part of the column.

My Dataframe:

0     1        2    ...     6      7     8  
3   12:53    2.778  ...   -12.2  -16.7  69%         
4   11:53    3.611  ...    NaN   -17.2  73%        
5   10:53    2.778  ...   -13.7  -17.2  73%       
6   09:53    3.611  ...   -13.3  -17.2  73%        
7   08:53    2.778  ...   -12.8  -16.7  76%        
… 
74  13:53    0.278  ...    -15   -17.2  83%

Here is my modified attempt at calling the last 4 in the column as follows:

for i in df.index:
    df[6].fillna(df[6].iloc[0:3].mean(), inplace=True)

This is returning the first 4 rows of column 6 which makes sense why I am getting that result but I do not know how to have it call the 4 rows after the NaN and average them.

Kyle C
  • 121
  • 1
  • 10
  • 2
    What if there are no four prior rows for a NaN? What if there are NaNs in one of the four prior rows? Have you thought this through? – timgeb Jan 08 '19 at 22:41
  • That is a great question. I hadn't encountered having a NaN in the last few rows yet so I haven't gotten there. I think that I would then take an average of the four hours after the NaN if it occurs in one of the last rows. But then if there are multiple NaNs back to back I'm in trouble again. – Kyle C Jan 08 '19 at 22:45

3 Answers3

3

You can use rolling with window of n to get the rolling average of the previous n indices, and then fill your na with those values:

df = pd.DataFrame({'col1':[1,2,3,4,5,6,np.nan,8,9,10]})
df['rollmean5'] = df['col1'].rolling(5,center=False,min_periods=1).mean()
df['col1'] = df['col1'].fillna(df['rollmean5'])

Using a rolling window of 5 gets the average of the current index, and the previous 4.

Output:

    col1    rollmean5
0   1.0     1.00
1   2.0     1.50
2   3.0     2.00
3   4.0     2.50
4   5.0     3.00
5   6.0     4.00
6   4.5     4.50
7   8.0     5.75
8   9.0     7.00
9   10.0    8.25

And obviously, you could drop the newly created rollmean5 when you are done.

Joe Patten
  • 1,664
  • 1
  • 9
  • 15
1

This should work? Not sure if there is something already built into pandas. Had to ignore the cases where there are not 4 preceding rows.

for i in df.index:
    if i < 4:
        pass
    else:
        df['6'].fillna(df.iloc[i-4:i,5].mean(), inplace=True)

Note it looks like your column '6' actually has index 5 which is why the iloc uses 5 for column.

capohugo
  • 111
  • 1
  • 2
0

Using the code provided by Joe Patten above worked great until I had a NaN in the top row of the column. To solve this I am checking if the first value is NaN, if it is, invert the dataframe, and run a rolling average from what was the bottom up and then reorient back to the correct postion. If the first value is not a NaN, run the rolling average code.

if df.isnull().loc[3,6] == 'True':
    df['rollmean5'] = df[6].loc[::-1].rolling(5,center=False,min_periods=1).mean()
    df['rollmean5'] = df['rollmean5'].loc[::-1]
    df[6] = df[6].fillna(round(df['rollmean5'],2))
else:
    df['rollmean5'] = df[6].rolling(5,center=False,min_periods=1).mean()
    df[6] = df[6].fillna(round(df['rollmean5'],2))
Kyle C
  • 121
  • 1
  • 10