290

Is there a convenient way to calculate percentiles for a sequence or single-dimensional numpy array?

I am looking for something similar to Excel's percentile function.

I looked in NumPy's statistics reference, and couldn't find this. All I could find is the median (50th percentile), but not something more specific.

kmario23
  • 57,311
  • 13
  • 161
  • 150
Uri
  • 88,451
  • 51
  • 221
  • 321
  • A related question on computation of percentiles from frequencies: https://stackoverflow.com/questions/25070086/percentiles-from-counts-of-values – newtover Oct 10 '19 at 07:33

12 Answers12

381

NumPy has np.percentile().

import numpy as np
a = np.array([1,2,3,4,5])
p = np.percentile(a, 50)  # return 50th percentile, i.e. median.
>>> print(p)
3.0

SciPy has scipy.stats.scoreatpercentile(), in addition to many other statistical goodies.

Mateen Ulhaq
  • 24,552
  • 19
  • 101
  • 135
Jon W
  • 15,480
  • 6
  • 37
  • 47
  • 2
    Thank you! So that's where it's been hiding. I was aware of scipy but I guess I assumed simple things like percentiles would be built into numpy. – Uri Mar 03 '10 at 20:51
  • 19
    By now, a percentile function exists in numpy: http://docs.scipy.org/doc/numpy/reference/generated/numpy.percentile.html – Anaphory Oct 29 '13 at 14:36
  • 1
    You can use it as an aggregation function as well, e.g. to compute the tenth percentile of each group of a value column by key, use `df.groupby('key')[['value']].agg(lambda g: np.percentile(g, 10))` – patricksurry Nov 26 '13 at 17:25
  • 1
    Note that SciPy recommends to use np.percentile for NumPy 1.9 and higher – Tim Diels Nov 26 '15 at 18:21
88

By the way, there is a pure-Python implementation of percentile function, in case one doesn't want to depend on scipy. The function is copied below:

## {{{ http://code.activestate.com/recipes/511478/ (r1)
import math
import functools

def percentile(N, percent, key=lambda x:x):
    """
    Find the percentile of a list of values.

    @parameter N - is a list of values. Note N MUST BE already sorted.
    @parameter percent - a float value from 0.0 to 1.0.
    @parameter key - optional key function to compute value from each element of N.

    @return - the percentile of the values
    """
    if not N:
        return None
    k = (len(N)-1) * percent
    f = math.floor(k)
    c = math.ceil(k)
    if f == c:
        return key(N[int(k)])
    d0 = key(N[int(f)]) * (c-k)
    d1 = key(N[int(c)]) * (k-f)
    return d0+d1

# median is 50th percentile.
median = functools.partial(percentile, percent=0.5)
## end of http://code.activestate.com/recipes/511478/ }}}
Boris Gorelik
  • 29,945
  • 39
  • 128
  • 170
  • 67
    I am the author of the above recipe. A commenter in ASPN has pointed out the original code has a bug. The formula should be d0 = key(N[int(f)]) * (c-k); d1 = key(N[int(c)]) * (k-f). It has been corrected on ASPN. – Wai Yip Tung Apr 25 '11 at 03:43
  • 2
    How does `percentile` know what to use for `N`? It isn't specified in the function call. – Richard Oct 31 '13 at 09:54
  • 22
    for those who didn't even read the code, before using it, N must be sorted – kevin Mar 04 '14 at 02:55
  • I'm confused by the lambda expression. What does it do and how does it do it? I know what lambda expression are so I am not asking what lambda is. I am asking what does this specific lambda expression do and how is it doing it, step-by-step? Thanks! – dsanchez Oct 27 '18 at 06:09
  • The lambda function lets you transform the data in `N` before calculating a percentile. Say you actually have a list of tuples `N = [(1, 2), (3, 1), ..., (5, 1)]` and you want to get the percentile of the _first_ element of the tuples, then you choose `key=lambda x: x[0]`. You could also apply some (order-changing) transformation to the list elements before calculating a percentile. – Elias Strehle Nov 25 '19 at 11:55
  • @dsanchez In this case, the lambda can be used to map the data being passed in to values that can be evaluated as a percentile. In theory, you could pass the function a sorted dictionary of words and use the lambda to map the entries as word lengths, sums of ASCII values, etc. The lambda takes each entry x and maps it to a new value. The default parameter passed here just maps each entry as itself (x:x). – mdhansen Mar 03 '21 at 16:14
37
import numpy as np
a = [154, 400, 1124, 82, 94, 108]
print np.percentile(a,95) # gives the 95th percentile
richie
  • 17,568
  • 19
  • 51
  • 70
34

Starting Python 3.8, the standard library comes with the quantiles function as part of the statistics module:

from statistics import quantiles

quantiles([1, 2, 3, 4, 5], n=100)
# [0.06, 0.12, 0.18, 0.24, 0.3, 0.36, 0.42, 0.48, 0.54, 0.6, 0.66, 0.72, 0.78, 0.84, 0.9, 0.96, 1.02, 1.08, 1.14, 1.2, 1.26, 1.32, 1.38, 1.44, 1.5, 1.56, 1.62, 1.68, 1.74, 1.8, 1.86, 1.92, 1.98, 2.04, 2.1, 2.16, 2.22, 2.28, 2.34, 2.4, 2.46, 2.52, 2.58, 2.64, 2.7, 2.76, 2.82, 2.88, 2.94, 3.0, 3.06, 3.12, 3.18, 3.24, 3.3, 3.36, 3.42, 3.48, 3.54, 3.6, 3.66, 3.72, 3.78, 3.84, 3.9, 3.96, 4.02, 4.08, 4.14, 4.2, 4.26, 4.32, 4.38, 4.44, 4.5, 4.56, 4.62, 4.68, 4.74, 4.8, 4.86, 4.92, 4.98, 5.04, 5.1, 5.16, 5.22, 5.28, 5.34, 5.4, 5.46, 5.52, 5.58, 5.64, 5.7, 5.76, 5.82, 5.88, 5.94]
quantiles([1, 2, 3, 4, 5], n=100)[49] # 50th percentile (e.g median)
# 3.0

quantiles returns for a given distribution dist a list of n - 1 cut points separating the n quantile intervals (division of dist into n continuous intervals with equal probability):

statistics.quantiles(dist, *, n=4, method='exclusive')

where n, in our case (percentiles) is 100.

Xavier Guihot
  • 54,987
  • 21
  • 291
  • 190
  • 2
    Just a note. With method="exclusive" p99 can be larger than maximum value in original list. If it is not what you want, i.e. you want p100 = max, then use method="inclusive". – Amaimersion May 20 '22 at 08:53
27

Here's how to do it without numpy, using only python to calculate the percentile.

import math

def percentile(data, perc: int):
    size = len(data)
    return sorted(data)[int(math.ceil((size * perc) / 100)) - 1]

percentile([10.0, 9.0, 8.0, 7.0, 6.0, 5.0, 4.0, 3.0, 2.0, 1.0], 90)
# 9.0
percentile([142, 232, 290, 120, 274, 123, 146, 113, 272, 119, 124, 277, 207], 50)
# 146
Pavel Vlasov
  • 4,206
  • 6
  • 41
  • 54
Ashkan
  • 1,865
  • 16
  • 13
13

The definition of percentile I usually see expects as a result the value from the supplied list below which P percent of values are found... which means the result must be from the set, not an interpolation between set elements. To get that, you can use a simpler function.

def percentile(N, P):
    """
    Find the percentile of a list of values

    @parameter N - A list of values.  N must be sorted.
    @parameter P - A float value from 0.0 to 1.0

    @return - The percentile of the values.
    """
    n = int(round(P * len(N) + 0.5))
    return N[n-1]

# A = (1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
# B = (15, 20, 35, 40, 50)
#
# print percentile(A, P=0.3)
# 4
# print percentile(A, P=0.8)
# 9
# print percentile(B, P=0.3)
# 20
# print percentile(B, P=0.8)
# 50

If you would rather get the value from the supplied list at or below which P percent of values are found, then use this simple modification:

def percentile(N, P):
    n = int(round(P * len(N) + 0.5))
    if n > 1:
        return N[n-2]
    else:
        return N[0]

Or with the simplification suggested by @ijustlovemath:

def percentile(N, P):
    n = max(int(round(P * len(N) + 0.5)), 2)
    return N[n-2]
mpounsett
  • 1,174
  • 1
  • 10
  • 30
  • thanks, I also expect percentile/median to result actual values from the sets and not interpolations – hansaplast Nov 16 '11 at 15:44
  • 1
    Hi @mpounsett. Thank you for the upper code. Why does your percentile always return integer values? The percentile function should return the N-th percentile of a list of values, and this can be a float number too. For example, the Excel ```PERCENTILE``` function returns the following percentiles for your upper examples: ```3.7 = percentile(A, P=0.3)```,```0.82 = percentile(A, P=0.8)```, ```20 = percentile(B, P=0.3)```, ```42 = percentile(B, P=0.8)```. – marco Jun 07 '16 at 10:41
  • 1
    It's explained in the first sentence. The more common definition of percentile is that it is the number in a series below which P percent of values in the series are found. Since that is the index number of an item in a list, it cannot be a float. – mpounsett Aug 08 '16 at 18:59
  • 1
    This doesn't work for the 0'th percentile. It returns the maximum value. A quick fix would be to wrap the `n = int(...)` in a `max(int(...), 1)` function – ijustlovemath Dec 14 '16 at 22:07
  • To clarify, do you mean in the second example? I get 0 rather than the maximum value. The bug is actually in the else clause.. I printed the index number rather than the value I intended to. Wrapping the assignment of 'n' in a max() call would also fix it, but you'd want the second value to be 2, not 1. You could then eliminate the entire if/else structure and just print the result of N[n-2]. 0th percentile works fine in the first example, returning '1' and '15' respectively. – mpounsett Jan 10 '17 at 16:19
  • sorry for the downvote. Accidental. Made on my phone without noticing it. it is now locked!! – keepAlive Mar 06 '20 at 00:40
6

check for scipy.stats module:

 scipy.stats.scoreatpercentile
karthikr
  • 97,368
  • 26
  • 197
  • 188
Evert
  • 69
  • 1
  • 1
2

To calculate the percentile of a series, run:

from scipy.stats import rankdata
import numpy as np

def calc_percentile(a, method='min'):
    if isinstance(a, list):
        a = np.asarray(a)
    return rankdata(a, method=method) / float(len(a))

For example:

a = range(20)
print {val: round(percentile, 3) for val, percentile in zip(a, calc_percentile(a))}
>>> {0: 0.05, 1: 0.1, 2: 0.15, 3: 0.2, 4: 0.25, 5: 0.3, 6: 0.35, 7: 0.4, 8: 0.45, 9: 0.5, 10: 0.55, 11: 0.6, 12: 0.65, 13: 0.7, 14: 0.75, 15: 0.8, 16: 0.85, 17: 0.9, 18: 0.95, 19: 1.0}
Roei Bahumi
  • 3,433
  • 2
  • 20
  • 19
2

A convenient way to calculate percentiles for a one-dimensional numpy sequence or matrix is by using numpy.percentile <https://docs.scipy.org/doc/numpy/reference/generated/numpy.percentile.html>. Example:

import numpy as np

a = np.array([0,1,2,3,4,5,6,7,8,9,10])
p50 = np.percentile(a, 50) # return 50th percentile, e.g median.
p90 = np.percentile(a, 90) # return 90th percentile.
print('median = ',p50,' and p90 = ',p90) # median =  5.0  and p90 =  9.0

However, if there is any NaN value in your data, the above function will not be useful. The recommended function to use in that case is the numpy.nanpercentile <https://docs.scipy.org/doc/numpy/reference/generated/numpy.nanpercentile.html> function:

import numpy as np

a_NaN = np.array([0.,1.,2.,3.,4.,5.,6.,7.,8.,9.,10.])
a_NaN[0] = np.nan
print('a_NaN',a_NaN)
p50 = np.nanpercentile(a_NaN, 50) # return 50th percentile, e.g median.
p90 = np.nanpercentile(a_NaN, 90) # return 90th percentile.
print('median = ',p50,' and p90 = ',p90) # median =  5.5  and p90 =  9.1

In the two options presented above, you can still choose the interpolation mode. Follow the examples below for easier understanding.

import numpy as np

b = np.array([1,2,3,4,5,6,7,8,9,10])
print('percentiles using default interpolation')
p10 = np.percentile(b, 10) # return 10th percentile.
p50 = np.percentile(b, 50) # return 50th percentile, e.g median.
p90 = np.percentile(b, 90) # return 90th percentile.
print('p10 = ',p10,', median = ',p50,' and p90 = ',p90)
#p10 =  1.9 , median =  5.5  and p90 =  9.1

print('percentiles using interpolation = ', "linear")
p10 = np.percentile(b, 10,interpolation='linear') # return 10th percentile.
p50 = np.percentile(b, 50,interpolation='linear') # return 50th percentile, e.g median.
p90 = np.percentile(b, 90,interpolation='linear') # return 90th percentile.
print('p10 = ',p10,', median = ',p50,' and p90 = ',p90)
#p10 =  1.9 , median =  5.5  and p90 =  9.1

print('percentiles using interpolation = ', "lower")
p10 = np.percentile(b, 10,interpolation='lower') # return 10th percentile.
p50 = np.percentile(b, 50,interpolation='lower') # return 50th percentile, e.g median.
p90 = np.percentile(b, 90,interpolation='lower') # return 90th percentile.
print('p10 = ',p10,', median = ',p50,' and p90 = ',p90)
#p10 =  1 , median =  5  and p90 =  9

print('percentiles using interpolation = ', "higher")
p10 = np.percentile(b, 10,interpolation='higher') # return 10th percentile.
p50 = np.percentile(b, 50,interpolation='higher') # return 50th percentile, e.g median.
p90 = np.percentile(b, 90,interpolation='higher') # return 90th percentile.
print('p10 = ',p10,', median = ',p50,' and p90 = ',p90)
#p10 =  2 , median =  6  and p90 =  10

print('percentiles using interpolation = ', "midpoint")
p10 = np.percentile(b, 10,interpolation='midpoint') # return 10th percentile.
p50 = np.percentile(b, 50,interpolation='midpoint') # return 50th percentile, e.g median.
p90 = np.percentile(b, 90,interpolation='midpoint') # return 90th percentile.
print('p10 = ',p10,', median = ',p50,' and p90 = ',p90)
#p10 =  1.5 , median =  5.5  and p90 =  9.5

print('percentiles using interpolation = ', "nearest")
p10 = np.percentile(b, 10,interpolation='nearest') # return 10th percentile.
p50 = np.percentile(b, 50,interpolation='nearest') # return 50th percentile, e.g median.
p90 = np.percentile(b, 90,interpolation='nearest') # return 90th percentile.
print('p10 = ',p10,', median = ',p50,' and p90 = ',p90)
#p10 =  2 , median =  5  and p90 =  9

If your input array only consists of integer values, you might be interested in the percentil answer as an integer. If so, choose interpolation mode such as ‘lower’, ‘higher’, or ‘nearest’.

Italo Gervasio
  • 358
  • 2
  • 5
  • 1
    Thanks For mentioning the `interpolation` option since without it the outputs were misleading – Cypher Jan 24 '21 at 13:05
1

In case you need the answer to be a member of the input numpy array:

Just to add that the percentile function in numpy by default calculates the output as a linear weighted average of the two neighboring entries in the input vector. In some cases people may want the returned percentile to be an actual element of the vector, in this case, from v1.9.0 onwards you can use the "interpolation" option, with either "lower", "higher" or "nearest".

import numpy as np
x=np.random.uniform(10,size=(1000))-5.0

np.percentile(x,70) # 70th percentile

2.075966046220879

np.percentile(x,70,interpolation="nearest")

2.0729677997904314

The latter is an actual entry in the vector, while the former is a linear interpolation of two vector entries that border the percentile

ClimateUnboxed
  • 7,106
  • 3
  • 41
  • 86
1

for a series: used describe functions

suppose you have df with following columns sales and id. you want to calculate percentiles for sales then it works like this,

df['sales'].describe(percentiles = [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1])

0.0: .0: minimum
1: maximum 
0.1 : 10th percentile and so on
Ropali Munshi
  • 2,757
  • 4
  • 22
  • 45
ashwini
  • 11
  • 1
0

I bootstrap the data and then plotted out the confidence interval for 10 samples. The confidence interval shows the range where the probabilities will fall between 5 percent and 95 percent probability.

 import pandas as pd
 import matplotlib.pyplot as plt
 import seaborn as sns
 import numpy as np
 import json
 import dc_stat_think as dcst

 data = [154, 400, 1124, 82, 94, 108]
 #print (np.percentile(data,[0.5,95])) # gives the 95th percentile

 bs_data = dcst.draw_bs_reps(data, np.mean, size=6*10)

 #print(np.reshape(bs_data,(24,6)))

 x= np.linspace(1,6,6)
 print(x)
 for (item1,item2,item3,item4,item5,item6) in bs_data.reshape((10,6)):
     line_data=[item1,item2,item3,item4,item5,item6]
     ci=np.percentile(line_data,[.025,.975])
     mean_avg=np.mean(line_data)
     fig, ax = plt.subplots()
     ax.plot(x,line_data)
     ax.fill_between(x, (line_data-ci[0]), (line_data+ci[1]), color='b', alpha=.1)
     ax.axhline(mean_avg,color='red')
     plt.show()
Golden Lion
  • 3,840
  • 2
  • 26
  • 35