2
from numpy import percentile
import numpy as np
data=np.array([1,2,3,4,5,6,7,8,9,10])
# calculate quartiles
quartile_1 = percentile(data, 25)
quartile_3 =percentile(data, 75)
# calculate min/max

print(quartile_1) # show 3.25
print(quartile_3) # shows 7.75

can you explain how 3.25 and 7.75 value are calculated? I expected them to be 3 and 8.

Fredrik Pihl
  • 44,604
  • 7
  • 83
  • 130
  • 1
    `percentile(..., interpolation='nearest')`, see [`numpy.percentile`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.percentile.html). – HansHirse Nov 28 '19 at 10:54
  • 1
    See https://docs.scipy.org/doc/numpy/reference/generated/numpy.percentile.html – Fredrik Pihl Nov 28 '19 at 10:54
  • If you're expecting *exactly* `[3,8]`, `numpy` does not use an iterative median method to determine quartiles, it *always* uses interpolation. – Daniel F Nov 28 '19 at 11:19

5 Answers5

3

Step-by-step calculation of Numpy percentile manually:

Step-1: Find length

x = [1,2,3,4,5,6,7,8,9,10]
l = len(x) 
# Output --> 10

Step-2: Subtract 1 to get distance from first to last item in x

# n = (length - 1) 
# n = (10-1) 
# Output --> 9

Step-3: Multiply n by quantile, here 25th percentile or 0.25 quantile or 1st quartile

n * 0.25
# Therefore, (9 * 0.25) 
# Output --> 2.25
# So, fraction is 0.25 part of 2.25
# m = 0.25

Step-4: Now get final answer

For Linear:

# i + (j - i) * m
# Here, think i and j as values at indices
# x = [1,2,3,4,5,6,7,8,9,10]
#idx= [0,1,2,3,.........,9]
# So, for '2.25':
# value at index immediately before 2.25, is at index=2 so, i=3
# value at index immediately after 2.25, is at index=3 so, i=4
# and fractions 
3 + (4 - 3)*0.25
# Output --> 3.25

For Lower:

# Here, based on output from Step-3
# Because, it is '2.25', 
# Find a number a index lower than 2.25
# So, lower index is '2'
# x = [1,2,3,4,5,6,7,8,9,10]
#idx= [0,1,2,3,.........,9]
# So, at index=2 we have '3' 
# Output --> 3

For Higher:

# Here, based on output from Step-3
# Because, it is '2.25', 
# Find a number a index higher than 2.25
# So, higher index is '3'
# x = [1,2,3,4,5,6,7,8,9,10]
#idx= [0,1,2,3,.........,9]
# So, at index=3 we have '4' 
# Output --> 4

For Nearest:

# Here, based on output from Step-3
# Because, it is '2.25', 
# Find a number a index nearest to 2.25
# So, nearest index is '2'
# x = [1,2,3,4,5,6,7,8,9,10]
#idx= [0,1,2,3,.........,9]
# So, at index=2 we have '3' 
# Output --> 3

For Midpoint:

# Here, based on output from Step-3
# (i + j)/2
# Here, think i and j as values at indices
# x = [1,2,3,4,5,6,7,8,9,10]
#idx= [0,1,2,3,.........,9]
# So, for '2.25'
# value at index immediately before 2.25, is at index=2 so, i=3
# value at index immediately after 2.25, is at index=3 so, i=4
(3+4)/2
# Output --> 3.5

Code in Python:

x = np.array([1,2,3,4,5,6,7,8,9,10])
print("linear:", np.percentile(x, 25, interpolation='linear'))
print("lower:", np.percentile(x, 25, interpolation='lower'))
print("higher:", np.percentile(x, 25, interpolation='higher'))
print("nearest:", np.percentile(x, 25, interpolation='nearest'))
print("midpoint:", np.percentile(x, 25, interpolation='midpoint'))

Output:

linear: 3.25
lower: 3
higher: 4
nearest: 3
midpoint: 3.5
Nilesh Ingle
  • 1,777
  • 11
  • 17
1

Versions 1.9.0 of Numpy or greater have an optional 'interpolation' parameter, which is linear by default.

This optional parameter specifies the interpolation method to use when the desired percentile lies between two data points i < j:

‘linear’: i + (j - i) * fraction, where fraction is the fractional part of the index surrounded by i and j.

If you're looking to change that behavior, you'll just want to add the argument manually and override the default using interpolation='nearest’

Community
  • 1
  • 1
mochsner
  • 307
  • 2
  • 10
1

While this could be an interpolation issue, by some quartile methods (namely method 2) the answer should be exactly [3, 8]

As per my answer here and here, numpy uses method 3 instead.

Unfortunately until the field of statistics comes up with a unified definition of what a quartile is, confusion will continue.

Daniel F
  • 13,620
  • 2
  • 29
  • 55
0

From numpy documentation:

Given a vector V of length N, the q-th percentile of V is the value q/100 of the way from the minimum to the maximum in a sorted copy of V. The values and distances of the two nearest neighbors as well as the interpolation parameter will determine the percentile if the normalized ranking does not match the location of q exactly. This function is the same as the median if q=50, the same as the minimum if q=0 and the same as the maximum if q=100.

So the issue is with how numpy reacts when an exact match to your quantile is not found. If you use interpolation="nearest", you will get the results you would expect to get:

>>> from numpy import percentile
>>> import numpy as np
>>> data=np.array([1,2,3,4,5,6,7,8,9,10])
>>> # calculate quartiles
... quartile_1 = percentile(data, 25, interpolation="nearest")
>>> quartile_3 = percentile(data, 75, interpolation="nearest")
>>> print(quartile_1) 
3
>>> print(quartile_3) 
8
Christian W.
  • 2,532
  • 1
  • 19
  • 31
0

There are various options that can be used depending on the type of interpolation method that you want the percentile to be calculated at.

a = np.arange(1, 11)
a  # array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

np.percentile(a, (25, 75), interpolation='midpoint') # array([3.5, 7.5])
np.percentile(a, (25, 75), interpolation='nearest')  # array([3, 8])
np.percentile(a, (25, 75), interpolation='nearest')  # array([3, 8])
np.percentile(a, (25, 75), interpolation='linear')   # array([3.25, 7.75])
np.percentile(a, (25, 75), interpolation='lower')    # array([3, 7])
np.percentile(a, (25, 75), interpolation='higher')   # array([4, 8])

You will note that the cumulative relative frequency is what the percentiles need to be derived from

c = np.cumsum(a)
c  # ---- array([ 1,  3,  6, 10, 15, 21, 28, 36, 45, 55], dtype=int32)
c/c[-1] * 100
array([  1.81818182,   5.45454545,  10.90909091,  18.18181818,
        27.27272727,  38.18181818,  50.90909091,  65.45454545,
        81.81818182, 100.        ])

and percentiles for 25 and 75 will require an interpolation of some form.

NaN
  • 2,212
  • 2
  • 18
  • 23