4

I am working with data like the following. The dataframe is sorted by the date:

category  value  Date
0         1      24/5/2019     
1         NaN    24/5/2019         
1         1      26/5/2019       
2         2      1/6/2019      
1         2      23/7/2019       
2         NaN    18/8/2019         
2         3      20/8/2019       
7         3      1/9/2019 
1         NaN    12/9/2019       
2         NaN      13/9/2019       

I would like to replace the "NaN" values with the previous mean for that specific category.

What is the best way to do this in pandas?

Some approaches I considered:

1) This litte riff:

   df['mean' = df.groupby('category')['time'].apply(lambda x: x.shift().expanding().mean()))

source

This gets me the the correct means in but in another column, and it does not replace the NaNs.

2) This riff replaces the NaNs with the average of the columns:

df = df.groupby(df.columns, axis = 1).transform(lambda x: x.fillna(x.mean()))

Source 2

Both of these do not exactly give what I want. If someone could guide me on this it would be much appreciated!

Convex Leopard
  • 121
  • 1
  • 12
  • What do you exactly mean with the "previous mean for that specific category"? Because the category's repeat. – Erfan Sep 15 '19 at 11:23
  • By this i mean, for every NaN value, look at the corresponding category, find the mean of that category across all previous dates, then replace the NaN with this calculated mean value. – Convex Leopard Sep 15 '19 at 11:25
  • `df['value']=df['value'].fillna(df.groupby('category')['value'].transform(lambda x: x.shift().expanding().mean()))`? – ansev Sep 15 '19 at 11:42

2 Answers2

4

You can replace value by new Series from shift + expanding + mean, first value of 1 group is not replaced, because no previous NaN values exits:

df['Date'] = pd.to_datetime(df['Date'])
s = df.groupby('category')['value'].apply(lambda x: x.shift().expanding().mean())
df['value'] = df['value'].fillna(s)
print (df)
   category  value       Date
0         0    1.0 2019-05-24
1         1    NaN 2019-05-24
2         1    1.0 2019-05-26
3         2    2.0 2019-01-06
4         1    2.0 2019-07-23
5         2    2.0 2019-08-18
6         2    3.0 2019-08-20
7         7    3.0 2019-01-09
8         1    1.5 2019-12-09
9         2    2.5 2019-09-13
M_S_N
  • 2,764
  • 1
  • 17
  • 38
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
2

You can use pandas.Series.fillna to replace NaN values:

df['value']=df['value'].fillna(df.groupby('category')['value'].transform(lambda x: x.shift().expanding().mean()))
print(df)

   category  value       Date
0         0    1.0  24/5/2019
1         1    NaN  24/5/2019
2         1    1.0  26/5/2019
3         2    2.0   1/6/2019
4         1    2.0  23/7/2019
5         2    2.0  18/8/2019
6         2    3.0  20/8/2019
7         7    3.0   1/9/2019
8         1    1.5  12/9/2019
9         2    2.5  13/9/2019
ansev
  • 30,322
  • 5
  • 17
  • 31