2

I'm migrating some legacy code from R to Python and I'm having trouble matching the quantile results with numpy percentile.

Given the following list of numbers:

a1 = [
    5.75,6.13333333333333,7.13636363636364,9,10.1,4.80952380952381,8.82926829268293,4.7906976744186,3.83333333333333,6,6.1,
    8.88235294117647,30,5.7,3.98507462686567,6.83333333333333,8.39805825242718,4.78260869565217,7.26356589147287,5.67857142857143,
    3.58333333333333,6.69230769230769,14.3333333333333,14.3333333333333,5.125,5.16216216216216,5.36363636363636,10.7142857142857,
    4.90909090909091,7.5,8,6,6.93939393939394,10.4,6,6.8,5.33333333333333,10.3076923076923,4.5625,5.4,6.44,3.36363636363636,
    11.1666666666667,4.5,7.35714285714286,10.6363636363636,9.26746031746032,3.83333333333333,5.75,9.14285714285714,8.27272727272727,
    5,5.92307692307692,5.23076923076923,4.09375,6.25,4.63888888888889,6.07142857142857,5,5.42222222222222,3.93892045454545,4.8,
    8.71428571428571,6.25925925925926,4.12,5.30769230769231,4.26086956521739,5.22222222222222,4.64285714285714,5,3.64705882352941,
    5.33333333333333,3.65217391304348,3.54166666666667,10.0952380952381,3.38235294117647,8.67123287671233,2.66666666666667,3.5,4.875,
    4.5,6.2,5.45454545454545,4.89189189189189,4.71428571428571,1,5.33333333333333,6.09090909090909,4.36756756756757,6,5.17197452229299,
    4.48717948717949,5.01219512195122,4.83098591549296,5.25,8.52,5.47692307692308,5.45454545454545,8.6578947368421,8.35714285714286,3.25,
    8.5,4,5.95652173913043,7.05882352941176,7.5,8.6,8.49122807017544,5.14285714285714,4,13.3294117647059,9.55172413793103,5.57446808510638,
    4.5,8,4.11764705882353,3.9,5.14285714285714,6,4.66666666666667,6,3.75,4.93333333333333,4.5,5.21666666666667,6.53125,6,7,7.28333333333333,
    7.34615384615385,7.15277777777778,8.07936507936508,11.609756097561
]

Using quantile in R such that

quantile(a1, probs=.05, type=2)

Gives a results of 3.541667

Trying all of the interpolation methods in numpy to find the same result:

{x:np.percentile(a1,q=5, interpolation=x) for x in ['linear','lower','higher','nearest','midpoint']}

Yields

{'linear': 3.566666666666666,
 'lower': 3.54166666666667,
 'higher': 3.58333333333333,
 'nearest': 3.58333333333333,
 'midpoint': 3.5625}

As we can see the lower interpolation method returns the same result as R quantile type 2

However again with a different quantile in R we get different results:

quantile(a1, probs=.95, type=2)

Gives a result of 10.71429

And with numpy:

{x:np.percentile(a1,q=95, interpolation=x) for x in ['linear','lower','higher','nearest','midpoint']}

Yields

{'linear': 10.667532467532439,
 'lower': 10.6363636363636,
 'higher': 10.7142857142857,
 'nearest': 10.6363636363636,
 'midpoint': 10.67532467532465}

In this case the higher interpolation method returns the same result

I'm hoping that someone familiar enough w/the R quantile types can help me reproduce the same quantile logic in numpy.

Chris
  • 15,819
  • 3
  • 24
  • 37
  • 1
    I don't believe any of the discontinuous quantiles in R (type 1, 2, 3) have off the shelf implementations in any of the normal python libraries. The *closest* I've found is `scipy` https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mstats.mquantiles.html, the documentation even give you the value for the parameters to reproduce the R quantiles, but only for types 4-9 – ALollz Apr 23 '20 at 16:22

1 Answers1

1

You can implement this yourself. With type=2 it's a rather simple calculation. You either take the next highest order statistic or at a discontinuity (i.e. 100 values and you want the p=0.06, which falls exactly on the 6th value) you take the average of that order statistic and the next greatest order statistic.

import numpy as np

def R_type2(arr, p):
    """
    arr : array-like
    p : float between [0, 1]
    """
    #m=0 for Q_2(p) in R
    x = np.sort(arr)
    n = len(x)

    aleph = n*p
    k = np.floor(np.array(aleph).clip(1, n-1)).astype(int)
    gamma = {False: 1, True: 0.5}.get(aleph==k)   # Discontinuity or not

    # Deal with case where it should be smallest value 
    if aleph < 1:
        return x[k-1]  # x[0]
    else:
        return (1.-gamma)*x[k-1] + gamma*x[k]


R_type2(a1, 0.05)
#3.54166666666667

R_type2(a1, 0.95)
#10.7142857142857

A word of caution. k will be an integer while n*p is a float. In general it's a very bad idea to do aleph==k because this leads to problems with floating point inaccuracies. For instance with 100 numbers p=0.07 is NOT considered a discontinuity because 0.07 cannot be represented precisely. However, because R seems to implement a pure equality check I left it like the above for consistency.

I personally would favor changing from the equaltiy: {False: 1, True: 0.5}.get(aleph==k) to {False: 1, True: 0.5}.get(np.isclose(aleph,k)) that way floating point issues don't become a problem.

ALollz
  • 57,915
  • 7
  • 66
  • 89
  • I had come to a far less eloquently coded solution such as this based on some git repositories that attempted to do this. Your answer is much better looking, and hopefully helps others out if they come across the same issue! – Chris Apr 23 '20 at 19:01
  • @Chris To be fair this is **very** bare-bones. For a real function, there should probably be a fair amount of error handling and edge cases you could add (for instance providing an array with a length of 0 or 1). You could also make this so that `p` can be an array of quantiles as opposed to a single value. While none of that is really difficult, I felt like it was more distracting to the core question. – ALollz Apr 23 '20 at 19:05
  • 1
    I did find a pretty well fleshed out example https://github.com/eurostat/quantile/blob/master/src/quantile.py – Chris Apr 23 '20 at 19:44