what's the inverse of the quantile function on a pandas Series?

Question

The quantile functions gives us the quantile of a given pandas series s,

E.g.

s.quantile(0.9) is 4.2

Is there the inverse function (i.e. cumulative distribution) which finds the value x such that

s.quantile(x)=4

Thanks

score 85 · Accepted Answer · edited Jun 02 '23 at 10:01

85

Use scipy.stats.percentileofscore:

# libs required
from scipy import stats
import pandas as pd
import numpy as np

# generate ramdom data with same seed (to be reproducible)
np.random.seed(seed=1)
df = pd.DataFrame(np.random.uniform(0, 1, (10)), columns=['a'])

# quantile function
x = df.quantile(0.5)[0]

# inverse of quantile
stats.percentileofscore(df['a'], x)

edited Jun 02 '23 at 10:01

Cristian Ciupitu

20,270
7
50
76

answered Nov 18 '15 at 14:09

fernandosjp

2,658
1
25
29

7

It is worth noting that if you have NaN values in your series, the quantile and percentile of score functions do not seem to treat them the same way i.e. the functions are not the exact opposites of each other. – agftrading Dec 16 '18 at 16:56
2

Note that pandas interpolation results in an inconsistent answer when the quantile does not line up precisely with a value; try `quantile(0.51)`, for instance, and the inverse will not be the same. – Brian Bien Feb 10 '19 at 17:01
2

just do y = stats.percentileofscore(df['a'].dropna(), x) to get the inverse that matches up with df['a].quantile(y) == x – jma Sep 10 '19 at 19:09

ILoveCoding · Answer 2 · 2015-03-04T18:18:49.273

24

Sorting can be expensive, if you look for a single value I'd guess you'd be better of computing it with:

s = pd.Series(np.random.uniform(size=1000))
( s < 0.7 ).astype(int).mean() # =0.7ish

There's probably a way to avoid the int(bool) shenanigan.

edited Mar 04 '15 at 18:18

answered Mar 04 '15 at 18:09

ILoveCoding

866
1
9
18

3

This is clever. (s < 0.7).mean() works for me with pandas 0.23.0 – RedPanda Apr 24 '19 at 00:19
What about getting this for several values, similar to the quantile method? You can pass quantile a list of quantiles to compute, using the q argument. – JoAnn Alvarez Sep 28 '20 at 16:58

score 13 · Answer 3 · edited Jun 02 '23 at 10:06

13

Mathematically speaking, you're trying to find the CDF or return the probability of s being smaller than or equal to a value or quantile of q:

F(q) = Pr[s <= q]

One can use numpy.mean and try this one-line code:

np.mean(s.to_numpy() <= q)

edited Jun 02 '23 at 10:06

Cristian Ciupitu

20,270
7
50
76

answered Oct 01 '19 at 14:19

Anastasiya-Romanova 秀

3,326
28
48

1

Imo using mean(x) = sum(x)/len(x) an elegant solution needing a bit of explanation: it counts (sum(x)) how many values are lower or equal to q and makes it a relative frequency (/len(x)). This is the definition of the ECDF (Empirical distribution function https://en.wikipedia.org/wiki/Empirical_distribution_function). Technically ´s.to_numpy()´ is used to bring ´s´ into the right format (eg if ´s´ is a list like ´[1,2,2, 3, 3]´) - which imo is not needed if 's' is already a pandas series (´s = pd.Series([1,2,2, 3, 3])´) like stated in the OP. – Qaswed Jan 21 '20 at 16:08
@Qaswed, I think that is not true that "mean(x) = sum(x)/len(x) an elegant solution needing a bit of explanation: it counts (sum(x)) how many values are lower or equal to q and makes it a relative frequency (/len(x))." sum(x) is just adding all the elements in x. The distribution function would involve the sum of an indicator function (the indicator of x being <= to q). – JoAnn Alvarez Sep 28 '20 at 21:20

Mike · Answer 4 · 2014-10-21T17:45:26.137

There's no 1-liner that I know of, but you can achieve this with scipy:

import pandas as pd
import numpy as np
from scipy.interpolate import interp1d

# set up a sample dataframe
df = pd.DataFrame(np.random.uniform(0,1,(11)), columns=['a'])
# sort it by the desired series and caculate the percentile
sdf = df.sort('a').reset_index()
sdf['b'] = sdf.index / float(len(sdf) - 1)
# setup the interpolator using the value as the index
interp = interp1d(sdf['a'], sdf['b'])

# a is the value, b is the percentile
>>> sdf
    index         a    b
0      10  0.030469  0.0
1       3  0.144445  0.1
2       4  0.304763  0.2
3       1  0.359589  0.3
4       7  0.385524  0.4
5       5  0.538959  0.5
6       8  0.642845  0.6
7       6  0.667710  0.7
8       9  0.733504  0.8
9       2  0.905646  0.9
10      0  0.961936  1.0

Now we can see that the two functions are inverses of each other.

>>> df['a'].quantile(0.57)
0.61167933268395969
>>> interp(0.61167933268395969)
array(0.57)
>>> interp(df['a'].quantile(0.43))
array(0.43)

interp can also take in list, a numpy array, or a pandas data series, any iterator really!

Superfast solution for large arrays, much appreciated! – Sander van den Oord Nov 24 '22 at 12:58 — Sander van den Oord, Nov 24 '22 at 12:58

tozCSS · Answer 5 · 2022-12-28T19:41:10.310

9

% of records in s that are less than x:

# Find the percentile of `x` in `s`
(s<x).mean()  # i.e., (s<x).sum()/len(s)

That's it.

You can also use pandas.Series.searchsorted when s is sorted:

s.searchsorted(x)/len(s)

edited Dec 28 '22 at 19:41

answered May 19 '21 at 20:39

tozCSS

5,487
2
34
31

score 7 · Answer 6 · answered Nov 07 '16 at 18:35

7

Just came across the same problem. Here's my two cents.

def inverse_percentile(arr, num):
    arr = sorted(arr)
    i_arr = [i for i, x in enumerate(arr) if x > num]

    return i_arr[0] / len(arr) if len(i_arr) > 0 else 1

answered Nov 07 '16 at 18:35

Calvin Ku

111
2
7

score 1 · Answer 7 · answered Sep 28 '20 at 20:16

You can use the ECDF function from statsmodels. ECDF stands for empirical distribution function, "empirical" referring to the fact that the function it's creating is based on what is observed in your data.

Suppose you have a series s:

import pandas as pd
s = pd.Series(np.random.uniform(size=1000))

You can evaluate the CDF at 0.282:

(s <= 0.282).mean()

Or you can create the ECDF using the statsmodels function:

from statsmodels.distributions.empirical_distribution import ECDF

ecdf_s = ECDF(s)

ecdf_s

[ecdf_s(k) for k in [0.282, 0.544, 0.775]]

And check that it is the inverse of the quantiles:

s.quantile([0.25, 0.50, 0.75])

This one liner `(s <= 0.282).mean()` is pure genius. – NMech May 31 '22 at 07:40 — NMech, May 31 '22 at 07:40

Donato Mallozzi · Answer 8 · 2021-03-16T07:23:16.667

I use the np.searchsorted function to "find indices where elements should be inserted to maintain order":

np.random.seed(seed=1)

#we want to find the 5th 10-tile of a series of 20 elements
S = 20
N = 10
n = 5

df = pd.DataFrame(np.random.uniform(0,1,S), columns=['a'])

#quantile N function
q = df['a'].quantile(np.arange(0,N+1)/(N))

print(q)

#retrieve the ntile
x = q.iloc[n]

print('-'*30)
print(f"the {n}th {N}-tile of the series is: {x}")

#inverse
print('-'*30)
print(f"{x} is in the {np.searchsorted(q,x)}th {N}-tile of the series")

#and it works also with a value not present in the series
x=x+random.uniform(-.2,.2)
print('-'*30)
print(f"{x} is in the {np.searchsorted(q,x)}th {N}-tile of the series")

output:

0.0    0.000114
0.1    0.085843
0.2    0.145482
0.3    0.194549
0.4    0.263180
0.5    0.371164
0.6    0.417135
0.7    0.455081
0.8    0.581045
0.9    0.688730
1.0    0.878117
Name: a, dtype: float64
------------------------------
the 5th 10-tile of the series is: 0.37116410063685884
------------------------------
0.37116410063685884 is in the 5th 10-tile of the series
------------------------------
0.27693796519907005 is in the 5th 10-tile of the series

score 1 · Answer 9 · answered Aug 02 '22 at 16:51

1

It's the proportion of values of s less or equal than x:

s.le(x).mean()

as others have pointed out.

answered Aug 02 '22 at 16:51

Alex Ferrer

31
1

score 0 · Answer 10 · answered Jun 27 '22 at 07:50

I came across this question while looking for an answer to a similar problem. I couldn't find the solution so I'll just post the answer here, hope it might help someone.

If you have an array arr and you are interested in the value which is at the 0.75-th quantile you can do:

np.quantile(arr, 0.75)

Now suppose the opposite. You have a value x and your question is: what is the percentile of this value?

You can do:

np.count_nonzero(arr > x) / len(arr)

Quick demo:

import numpy as np

x = np.array(range(1, 101))
print(np.quantile(x, 0.75))
print(np.count_nonzero(x > 20)/len(x))

This prints out:

75.25
0.8

score 0 · Answer 11 · answered Jul 06 '22 at 12:58

Over the years I kept coming back to this question every once in a while and decide do answer it for my own sake! hahaha

At some point pandas developed a function like PERCENTRANK in excel, which would be rank. Just make sure you set the flag pct=True to get the desired output.

s.rank(pct=True)

will retrieve the ranks for the entire series, you gave, you just look for the one you want later. In your case you could just add the number for to the original series.

score 0 · Answer 12 · answered Mar 29 '23 at 20:40

0

Try rank with pct=True. stats.percentileofscore gives the correct answer, however, at least for rolling I found rank with pct=True much faster

answered Mar 29 '23 at 20:40

e271p314

3,841
7
36
61

what's the inverse of the quantile function on a pandas Series?

12 Answers12