1

I am trying to clip outliers in the DataFrame based on quantiles for each column. Let's say

df = pd.DataFrame(pd.np.random.randn(10,2))

0   1
0   0.734355    0.594992
1   -0.745949   0.597601
2   0.295606    0.972196
3   0.474539    1.462364
4   0.238838    0.684790
5   -0.659094   0.451718
6   0.675360    -1.286660
7   0.713914    0.135179
8   -0.435309   -0.344975
9   1.200617    -0.392945

I currently use

df_clipped = df.apply(lambda col: col.clip(*col.quantile([0.05,0.95]).values))

0   1
0   0.734355    0.594992
1   -0.706865   0.597601
2   0.295606    0.972196
3   0.474539    1.241788
4   0.238838    0.684790
5   -0.659094   0.451718
6   0.675360    -0.884488
7   0.713914    0.135179
8   -0.435309   -0.344975
9   0.990799    -0.392945

This works but I am wondering if there is a more elegant pandas/numpy based approach.

hilberts_drinking_problem
  • 11,322
  • 3
  • 22
  • 51

1 Answers1

8

You can use clip and align on the first axis:

df.clip(df.quantile(0.05), df.quantile(0.95), axis=1)
Out: 
          0         1
0  0.734355  0.594992
1 -0.706864  0.597601
2  0.295606  0.972196
3  0.474539  1.241788
4  0.238838  0.684790
5 -0.659094  0.451718
6  0.675360 -0.884488
7  0.713914  0.135179
8 -0.435309 -0.344975
9  0.990799 -0.392945
ayhan
  • 70,170
  • 20
  • 182
  • 203