Eliminating all data over a given percentile

Question

I have a pandas DataFrame called data with a column called ms. I want to eliminate all the rows where data.ms is above the 95% percentile. For now, I'm doing this:

limit = data.ms.describe(90)['95%']
valid_data = data[data['ms'] < limit]

which works, but I want to generalize that to any percentile. What's the best way to do that?

score 119 · Accepted Answer · edited Oct 11 '18 at 12:20

119

Use the Series.quantile() method:

In [48]: cols = list('abc')

In [49]: df = DataFrame(randn(10, len(cols)), columns=cols)

In [50]: df.a.quantile(0.95)
Out[50]: 1.5776961953820687

To filter out rows of df where df.a is greater than or equal to the 95th percentile do:

In [72]: df[df.a < df.a.quantile(.95)]
Out[72]:
       a      b      c
0 -1.044 -0.247 -1.149
2  0.395  0.591  0.764
3 -0.564 -2.059  0.232
4 -0.707 -0.736 -1.345
5  0.978 -0.099  0.521
6 -0.974  0.272 -0.649
7  1.228  0.619 -0.849
8 -0.170  0.458 -0.515
9  1.465  1.019  0.966

edited Oct 11 '18 at 12:20

Community

1
1

answered Sep 02 '13 at 20:40

Phillip Cloud

24,919
11
68
88

using pandas,If I want compare different col with specific quantile, is there quick method similar numpy broadcasting? – Pengju Zhao Jul 28 '17 at 04:34
1

does it also work when removing over all columns, i.e. `df[df < df.quantile(.95)]`? I expect all the values being filtered out if not in the range and replace by `NaN` if needed. – Mattia Paterna Nov 23 '17 at 14:19
Principally the same but more concise: `df.query('a < a.quantile(.95)')`. If the column name is lengthy that can improve readability: `col = 'some_verbose_metric_name'; df.query(f'{col} < {col}.quantile(.95)')` – ribitskiyb Oct 16 '19 at 23:04

score 51 · Answer 2 · answered Sep 01 '16 at 16:02

51

numpy is much faster than Pandas for this kind of things :

numpy.percentile(df.a,95) # attention : the percentile is given in percent (5 = 5%)

is equivalent but 3 times faster than :

df.a.quantile(.95)  # as you already noticed here it is ".95" not "95"

so for your code, it gives :

df[df.a < np.percentile(df.a,95)]

answered Sep 01 '16 at 16:02

2diabolos.com

992
9
15

3

Can confirm that numpy's implementation is waaay faster if you can afford the column extract cost – deepelement Oct 17 '18 at 14:11
@2diabolos.com is there a way to implement percentile filter on multiple pandas column. – Furqan Hashim Mar 04 '19 at 10:48
Something like df[numpy.logical_and(df.a < np.percentile(df.a,95),df.b < np.percentile(df.b,95))] ? Or you may create a new question for that... – 2diabolos.com Mar 11 '19 at 13:13
1

@deepelement what is the tradeoff with column extract cost? – Fred Zimmerman Dec 06 '20 at 23:00

score 7 · Answer 3 · answered Oct 09 '21 at 14:46

7

You can use query for a more concise option:

df.query('ms < ms.quantile(.95)')

answered Oct 09 '21 at 14:46

hellpanderr

5,581
3
33
43

Eliminating all data over a given percentile

3 Answers3

Linked