0

as part of some data cleansing, i want to add the mean of a variable back into a dataframe to use if the variable is missing for a particular observation. so i've calculated my averages as follows

avg=all_data2.groupby("portfolio")"[sales"].mean().reset_index(name="sales_mean")

now I wanted to add that back into my original dataframe using a left join, but it doesnt appear to be working. what format is my avg now? I thought it would be a dataframe but is it something else?

Denis Mclaughlin
  • 53
  • 1
  • 1
  • 8

2 Answers2

2

UPDATED

This is probably the most succinct way to do it:

all_data2.sales = all_data2.sales.fillna(all_data2.groupby('portfolio').sales.transform('mean'))

This is another way to do it:

all_data2['sales'] = all_data2[['portfolio', 'sales']].groupby('portfolio').transform(lambda x: x.fillna(x.mean()))

Output:

   portfolio  sales
0          1   10.0
1          1   20.0
2          2   30.0
3          2   40.0
4          3   50.0
5          3   60.0
6          3    NaN
   portfolio  sales
0          1   10.0
1          1   20.0
2          2   30.0
3          2   40.0
4          3   50.0
5          3   60.0
6          3   55.0

To answer your the part of your question that reads "what format is my avg now? I thought it would be a dataframe but is it something else?", avg is indeed a dataframe but using it may not be the most direct way to update missing data in the original dataframe. The dataframe avg looks like this for the sample input data above:

   portfolio  sales_mean
0          1        15.0
1          2        35.0
2          3        55.0

A related SO question that you may find helpful is here.

constantstranger
  • 9,176
  • 2
  • 5
  • 19
0

If you want to add a new column, you can use this code:

df['sales_mean']=df[['sales_1','sales_2']].mean(axis=1)
Mr.F.K
  • 21
  • 1
  • 3