I have a relatively large DataFrame object (about a million rows, hundreds of columns), and I'd like to clip outliers in each column by group. By "clip outliers for each column by group" I mean - compute the 5% and 95% quantiles for each column in a group and clip values outside this quantile range.
Here's the setup I'm currently using:
def winsorize_series(s):
q = s.quantile([0.05, 0.95])
if isinstance(q, pd.Series) and len(q) == 2:
s[s < q.iloc[0]] = q.iloc[0]
s[s > q.iloc[1]] = q.iloc[1]
return s
def winsorize_df(df):
return df.apply(winsorize_series, axis=0)
and then, with my DataFrame called features
and indexed by DATE
, I can do
grouped = features.groupby(level='DATE')
result = grouped.apply(winsorize_df)
This works, except that it's very slow, presumably due to the nested apply
calls: one on each group, and then one for each column in each group. I tried getting rid of the second apply
by computing quantiles for all columns at once, but got stuck trying to threshold each column by a different value. Is there a faster way to accomplish this procedure?