I have a pandas dataframe which I would like to split into groups, calculate the mean and standard deviation, and then replace all outliers with the mean of the group. Outliers are defined as such if they are more than 3 standard deviations away from the group mean.
df = pandas.DataFrame({'a': ['A','A','A','B','B','B','B'], 'b': [1.1,1.2,1.1,3.3,3.4,3.3,100.0]})
I thought that the following would work:
df.groupby('a')['b'].transform(lambda x: x[i] if np.abs(x[i]-x.mean())<=(3*x.std()) else x.mean() for i in range(0,len(x)))
but get the following error:
NameError: name 'x' is not defined
I have also tried defining a transform function separately:
def trans_func(x):
mean = x.mean()
std = x.std()
length = len(x)
for i in range(0,length):
if abs(x[i]-mean)<=(3*std):
return x
else:
return mean
and then calling it like so:
df.groupby('a')['b'].transform(lambda x: trans_func(x))
but I get a different error:
KeyError: 0
Finally, I resorted to creating a separate column altogether:
df['c'] = [df.groupby('a')['b'].transform(mean) if df.groupby('a')['b'].transform(lambda x: (x - x.mean()) / x.std()) > 3 else df['b']]
but this hasn't worked either:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Any advice much appreciated.