For a pandas.Series, I know how to remove outliers. With something like this:
x = pd.Series(np.random.normal(size=1000))
iqr = x.quantile(.75) - x.quantile(.25)
y = x[x.between(x.quantile(.25) - 1.5*iqr, x.quantile(.75) + 1.5*iqr)]
I would like to do thins over the different Series/columns of a DataFrame
import string
import random
df = pd.DataFrame([])
df['A'] = pd.Series(np.random.normal(size=1000))
df['B'] = pd.Series(np.random.normal(size=1000, loc=-5, scale=1))
df['C'] = pd.Series(np.random.normal(size=1000, loc=10, scale=2))
df['index'] = pd.Series([random.choice(string.ascii_uppercase) for i in range(1000)])
df.set_index('index')
I usually do stuff like
df = df.groupby('index').mean()
However, in this case, it would also average the outliers, which I would like to ignore from averaging.
Notice that the random data makes than the outliers are in different positions in each column. So an outlier should be ignored only in that column/Series
The result should be a DataFrame, with 26 lines (one for each letter of index
), and 3 columns, with the values averaged without outliers
I can loop over the columns of df
and do the first block of code. But is there a nicer way?
Suggestion are welcome. Any approach is accepted