17

I have a relatively large DataFrame object (about a million rows, hundreds of columns), and I'd like to clip outliers in each column by group. By "clip outliers for each column by group" I mean - compute the 5% and 95% quantiles for each column in a group and clip values outside this quantile range.

Here's the setup I'm currently using:

def winsorize_series(s):
    q = s.quantile([0.05, 0.95])
    if isinstance(q, pd.Series) and len(q) == 2:
        s[s < q.iloc[0]] = q.iloc[0]
        s[s > q.iloc[1]] = q.iloc[1]
    return s

def winsorize_df(df):
    return df.apply(winsorize_series, axis=0)

and then, with my DataFrame called features and indexed by DATE, I can do

grouped = features.groupby(level='DATE')
result = grouped.apply(winsorize_df)

This works, except that it's very slow, presumably due to the nested apply calls: one on each group, and then one for each column in each group. I tried getting rid of the second apply by computing quantiles for all columns at once, but got stuck trying to threshold each column by a different value. Is there a faster way to accomplish this procedure?

ytsaig
  • 3,267
  • 3
  • 23
  • 27
  • It seems like this question is addressing the tool of Winsorization (which I'm looking for right now) while the related question is removing rows from the data frame. Different questions, imo and linked, but one does not solve the others' problem. – Surya Narayanan Nov 16 '22 at 19:23

5 Answers5

9

There is a winsorize function in scipy.stats.mstats which you might consider using. Note however, that it returns slightly different values than winsorize_series:

In [126]: winsorize_series(pd.Series(range(20), dtype='float'))[0]
Out[126]: 0.95000000000000007

In [127]: mstats.winsorize(pd.Series(range(20), dtype='float'), limits=[0.05, 0.05])[0]
Out[127]: 1.0

Using mstats.winsorize instead of winsorize_series is maybe (depending on N, M, P) ~1.5x faster:

import numpy as np
import pandas as pd
from scipy.stats import mstats

def using_mstats_df(df):
    return df.apply(using_mstats, axis=0)

def using_mstats(s):
    return mstats.winsorize(s, limits=[0.05, 0.05])

N, M, P = 10**5, 10, 10**2
dates = pd.date_range('2001-01-01', periods=N//P, freq='D').repeat(P)
df = pd.DataFrame(np.random.random((N, M))
                  , index=dates)
df.index.names = ['DATE']
grouped = df.groupby(level='DATE')

In [122]: %timeit result = grouped.apply(winsorize_df)
1 loops, best of 3: 17.8 s per loop

In [123]: %timeit mstats_result = grouped.apply(using_mstats_df)
1 loops, best of 3: 11.2 s per loop
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • Thanks, that's a good pointer, I didn't realize scipy had a `winsorize` function. However, I presume a more substantial speed up would be achieved if there's a way to do the operation in bulk on the DataFrame without having to operate column by column, similar to how one could standardize or normalize in bulk, e.g., http://stackoverflow.com/questions/12525722/normalize-data-in-pandas – ytsaig Dec 14 '14 at 09:24
  • Are there the same number of dates in each group? – unutbu Dec 14 '14 at 11:54
  • the group by operation is by date, so each group only has one date. Do you mean to ask whether each group has the same number of rows? The answer to that is no, each date can (and typically does) have a different number of rows. – ytsaig Dec 14 '14 at 12:31
  • Yes, I meant to ask if there were the same number of *rows* in each group. If they were all the same, there might have been a way to reshape the underlying NumPy array so winsorize could be called **once** on the whole array. As it stands, the calculation is slow because winsorize has to be called many times -- once for each group and column. – unutbu Dec 14 '14 at 13:08
  • 4
    @YT As you alluded to in the OP, pandas now has a `.clip()` function that should work for you, especially when combined with `.quantile()`. – Zhang18 May 22 '17 at 22:02
  • this method does not handle missing values properly, as in it includes them and screws with the percentiles of non-missing data. Using @Zhang18's suggestion is a better choice if you want to exclude missing data – rump roast Sep 28 '17 at 16:11
  • 1
    See this question I just posted, then answered, using clip() and quantile() as suggested by @Zhang18 to handle missing values:https://stackoverflow.com/questions/50612095/winsorizing-data-by-column-in-pandas-with-nan/50612631#50612631 – Jesse Blocher May 30 '18 at 20:10
7

Here is a solution without using scipy.stats.mstats:

def clip_series(s, lower, upper):
   clipped = s.clip(lower=s.quantile(lower), upper=s.quantile(upper), axis=1)
   return clipped

# Manage list of features to be winsorized
feature_list = list(features.columns)

for f in feature_list:
   features[f] = clip_series(features[f], 0.05, 0.95)
tnf
  • 303
  • 3
  • 10
6

I found a rather straightforward way to get this to work, using the transform method in pandas.

from scipy.stats import mstats

def winsorize_series(group):
    return mstats.winsorize(group, limits=[lower_lim,upper_lim])

grouped = features.groupby(level='DATE')
result = grouped.transform(winsorize_series)
mwolverine
  • 101
  • 1
  • 5
4

Good way to approach this is with vectorization. And for that, I love to use np.where.

import pandas as pd
import numpy as np
from scipy.stats import mstats
import timeit

data = pd.Series(range(20), dtype='float')

def WinsorizeCustom(data):
    quantiles = data.quantile([0.05, 0.95])
    q_05 = quantiles.loc[0.05]
    q_95 = quantiles.loc[0.95]

    out = np.where(data.values <= q_05,q_05, 
                                      np.where(data >= q_95, q_95, data)
                  )
    return out

For comparison, I wrapped the function from scipy in a function:

def WinsorizeStats(data):
    out = mstats.winsorize(data, limits=[0.05, 0.05])
    return out

But as you can see, even though my function is pretty fast, its still far from the Scipy implementation:

%timeit WinsorizeCustom(data)
#1000 loops, best of 3: 842 µs per loop

%timeit WinsorizeStats(data)
#1000 loops, best of 3: 212 µs per loop

If you are interested to read more about speeding up pandas code, I would suggest Optimization Pandas for speed and From Python to Numpy.

HonzaB
  • 7,065
  • 6
  • 31
  • 42
-1

There is a 2D array having rows as observations and columns as features. And, the requirement is omit the complete rows that have any abnormal feature values.

data = np.array([[1, 8, 13, 113, 401],
                 [2, 8, 15, 119, 402],
                 [1, 9, 14, 117, 399],
                 [100, 7, 12, 110, 409],
                 [4, 70, 11, 111, 404]
                 ])

Is there any APIs or function existing to do so?

S G
  • 651
  • 6
  • 10