1

I am trying to replace the missing values in a dataframe based on filtering of another column, "Country"

>>> data.head()
   Country  Advanced skiers, freeriders   Snow parks 
0       Greece                           NaN          NaN
1  Switzerland                           5.0          5.0
2          USA                           NaN          NaN
3       Norway                           NaN          NaN
4       Norway                           3.0          4.0

Obviously this is just a small snippet of the data, but I am looking to replace all the NaN values with the average value for each feature.

I have tried grouping the data by the country and then calculating the mean of each column. When I print out the resulting array, it comes up with the expected values. However, when I put it into the .fillna() method, the data appears unchanged

I've tried @DSM's solution from this similar post, but I am not sure how to apply it to multiple columns.

listOfRatings = ['Advanced skiers, freeriders', 'Snow parks']

print (data.groupby('Country')[listOfRatings].mean().fillna(0))
-> displays the expected results

data[listOfRatings] = data[listOfRatings].fillna(data.groupby('Country')[listOfRatings].mean().fillna(0))
-> appears to do nothing to the dataframe

Assuming this is the complete dataset, this is what I would expect the results to be.

   Country  Advanced skiers, freeriders   Snow parks 
0       Greece                           0.0          0.0
1  Switzerland                           5.0          5.0
2          USA                           0.0          0.0
3       Norway                           3.0          4.0
4       Norway                           3.0          4.0

Can anyone explain what I am doing wrong, and how to fix the code?

jpp
  • 159,742
  • 34
  • 281
  • 339
theotheraussie
  • 495
  • 1
  • 4
  • 14

1 Answers1

2

You can use transform for return new DataFrame with same size as original filled by aggregated values:

print (data.groupby('Country')[listOfRatings].transform('mean').fillna(0))
   Advanced skiers, freeriders  Snow parks
0                          0.0         0.0
1                          5.0         5.0
2                          0.0         0.0
3                          3.0         4.0
4                          3.0         4.0

#dynamic generate all columns names without Country
listOfRatings = data.columns.difference(['Country'])
df1 = data.groupby('Country')[listOfRatings].transform('mean').fillna(0)
data[listOfRatings] = data[listOfRatings].fillna(df1)
print (data)

print (data)

       Country  Advanced skiers, freeriders  Snow parks
0       Greece                          0.0         0.0
1  Switzerland                          5.0         5.0
2          USA                          0.0         0.0
3       Norway                          3.0         4.0
4       Norway                          3.0         4.0
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252