53

Is there a way to use the numpy.percentile function to compute weighted percentile? Or is anyone aware of an alternative python function to compute weighted percentile?

thanks!

user308827
  • 21,227
  • 87
  • 254
  • 417
  • IMO [the solution by Sam A below](https://stackoverflow.com/a/63440143/1156245) looks like a contender for current best practice. – geotheory Aug 24 '20 at 16:12

12 Answers12

77

Completely vectorized numpy solution

Here is the code I use. It's not an optimal one (which I'm unable to write with numpy), but still much faster and more reliable than accepted solution

def weighted_quantile(values, quantiles, sample_weight=None, 
                      values_sorted=False, old_style=False):
    """ Very close to numpy.percentile, but supports weights.
    NOTE: quantiles should be in [0, 1]!
    :param values: numpy.array with data
    :param quantiles: array-like with many quantiles needed
    :param sample_weight: array-like of the same length as `array`
    :param values_sorted: bool, if True, then will avoid sorting of
        initial array
    :param old_style: if True, will correct output to be consistent
        with numpy.percentile.
    :return: numpy.array with computed quantiles.
    """
    values = np.array(values)
    quantiles = np.array(quantiles)
    if sample_weight is None:
        sample_weight = np.ones(len(values))
    sample_weight = np.array(sample_weight)
    assert np.all(quantiles >= 0) and np.all(quantiles <= 1), \
        'quantiles should be in [0, 1]'

    if not values_sorted:
        sorter = np.argsort(values)
        values = values[sorter]
        sample_weight = sample_weight[sorter]

    weighted_quantiles = np.cumsum(sample_weight) - 0.5 * sample_weight
    if old_style:
        # To be convenient with numpy.percentile
        weighted_quantiles -= weighted_quantiles[0]
        weighted_quantiles /= weighted_quantiles[-1]
    else:
        weighted_quantiles /= np.sum(sample_weight)
    return np.interp(quantiles, weighted_quantiles, values)

Examples:

weighted_quantile([1, 2, 9, 3.2, 4], [0.0, 0.5, 1.])

array([ 1. , 3.2, 9. ])

weighted_quantile([1, 2, 9, 3.2, 4], [0.0, 0.5, 1.], sample_weight=[2, 1, 2, 4, 1])

array([ 1. , 3.2, 9. ])

Alleo
  • 7,891
  • 2
  • 40
  • 30
  • 3
    Nice code. What's the difference for old_style? I haven't got the point yet. – Syrtis Major Oct 14 '15 at 05:31
  • @SubStruct : there is some minor difference in defining quantile. I.e. you have three elements. I would expect it's 0.5 quantile to be median (which is true in both cases) and 0.33 quantile to be mean of first two elements. For `old_style` (`numpy.percentile` way) this is not true. Difference in practice is minor. – Alleo Oct 14 '15 at 20:04
  • Thank you. I noticed that the difference is indeed minor for large sample. However I don't understand why we would expect the 0.33 quantile is 0.25 for array [0, 0.5, 1]. – Syrtis Major Oct 15 '15 at 03:29
  • For me it seems 0.33 quantile is 0.33 for array [0, 0.5, 1] is more natural. Or it's a problem of definition, we just choose one according to the question we meet? Of course, it's not problem in most case. – Syrtis Major Oct 15 '15 at 04:22
  • @SubStruct my intuition is following: 0.33 should correspond to point where 1/3 of elements are smaller, 2/3 are greater. Taking point 'right in the middle' seems to be good solution, while actually I could use any point between first two elements. – Alleo Oct 15 '15 at 12:25
  • I have updated your code a bit for my needs. First, I think the way you have coded `values_sorted` is a bit counter-intuitive to me. I look at it like someone is asking me, "Do you want your values sorted?" To which I answer, "Yes", and they sort the values. So, I changed `values_sorted` to be `True` by default. Second, my data has NaN values. I have included another variable `remove_nan=True` to deal with them. So I have another if statement, and I use `numpy.isfinite()` to get rid of infinite values and corresponding weights. Thanks for the great code! I stole it, but opensource! ;-) – Kartik Dec 04 '15 at 05:17
  • 2
    Nice implementation of the method introduced in the last section of the wiki's webpage about weighted percentile [link](https://en.wikipedia.org/wiki/Percentile#Definition_of_the_Weighted_Percentile_method). – Li-Pin Juan Apr 18 '17 at 20:57
  • 2
    Note: For integer weights, the result of this function will be different from the more naive (or "correct", depending on definition) method of "repeating each value k times, where k is the weight", because it interpolates between a single point (with weight k) instead of k points of identical height. For example, if values=[1, 2] and sample_weight=[1, 3], the weighted median is 1.75, but the un-weighted median of [1, 2, 2, 2] would be 2. – jick Sep 16 '19 at 23:54
  • @jick this isn't just for integer weights right? This occurs any time the median lies "between" two points, e.g. `weighted_quantile([1, 2], 0.5, [1, 30.5]` also returns 1.97, where it really should be 2. Is there a way to get this result? – Max Ghenis Dec 05 '19 at 04:25
  • In general I'm struggling to create a simple, viable test of this function. The one provided in the answer doesn't test that the weighting works, since the answer is identical to the unweighted median. – Max Ghenis Dec 05 '19 at 04:31
  • 1
    @MaxGhenis I think you're right - it's just that with integer weight it's easier to assume that (weight 3) represents (same value repeated 3 times), which tripped me. :) – jick Dec 05 '19 at 05:34
  • Since it uses interpolation with weights, a `round(weighted_quantile(...))` should give you the correct desired output for integers, shouldn't it? – user2161065 Dec 23 '20 at 14:07
  • @Alleo There seems to be an issue with this. I can't include the full example in a comment, but here's a pastebin that shows a failing example. I ask it for the `1` quantile, which should give me the maximum value regardless of weighting, but instead it gives a bizarrely low result. https://pastebin.com/4dxBHWAM – Jordan Jan 24 '22 at 02:48
  • Disregard, it was an issue with numpy overflowing the int32 it was defaulting to. – Jordan Jan 24 '22 at 03:08
14

A quick solution, by first sorting and then interpolating:

def weighted_percentile(data, percents, weights=None):
    ''' percents in units of 1%
        weights specifies the frequency (count) of data.
    '''
    if weights is None:
        return np.percentile(data, percents)
    ind=np.argsort(data)
    d=data[ind]
    w=weights[ind]
    p=1.*w.cumsum()/w.sum()*100
    y=np.interp(percents, p, d)
    return y
Max Ghenis
  • 14,783
  • 16
  • 84
  • 132
Kambrian
  • 329
  • 2
  • 7
  • 5
    This produces different results for `weighted_percentile(np.array([0,3,6,9]),50,weights=np.array([1,3,3,1]))` and `weighted_percentile(np.array([0,3,3,3,6,6,6,9]),50,weights=None)` – Peter9192 Feb 20 '18 at 14:03
14

This seems to be now implemented in statsmodels

from statsmodels.stats.weightstats import DescrStatsW
wq = DescrStatsW(data=np.array([1, 2, 9, 3.2, 4]), weights=np.array([0.0, 0.5, 1.0, 0.3, 0.5]))
wq.quantile(probs=np.array([0.1, 0.9]), return_pandas=False)
# array([2., 9.])

The DescrStatsW object also has other methods implemented, such as weighted mean, etc. https://www.statsmodels.org/stable/generated/statsmodels.stats.weightstats.DescrStatsW.html

Sam A.
  • 413
  • 4
  • 13
12

Cleaner and simpler using this reference for weighted percentile method.

import numpy as np

def weighted_percentile(data, weights, perc):
    """
    perc : percentile in [0-1]!
    """
    ix = np.argsort(data)
    data = data[ix] # sort data
    weights = weights[ix] # sort weights
    cdf = (np.cumsum(weights) - 0.5 * weights) / np.sum(weights) # 'like' a CDF function
    return np.interp(perc, cdf, data)
imbr
  • 6,226
  • 4
  • 53
  • 65
11

I don' know what's Weighted percentile means, but from @Joan Smith's answer, It seems that you just need to repeat every element in ar, you can use numpy.repeat():

import numpy as np
np.repeat([1,2,3], [4,5,6])

the result is:

array([1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3])
HYRY
  • 94,853
  • 25
  • 187
  • 187
  • 2
    I suppose this is the better (as in more efficient) answer. – FooBar Sep 24 '14 at 19:46
  • 17
    Still, this only supports integer weights. And will most likely be very memory and CPU-time heavy for a larger data set. – PiHalbe Sep 26 '14 at 10:02
  • 3
    From personal experience I can confirm that this approach is definitely not efficient. If your vectors are long and the weights are big numbers your computer can quickly hit the memory limit. – geotheory Sep 10 '20 at 09:39
  • Also that wouldn't work with non-integer weights. – amyrit Jan 03 '22 at 13:55
10

Apologies for the additional (unoriginal) answer (not enough rep to comment on @nayyarv's). His solution worked for me (ie. it replicates the default behavior of np.percentage), but I think you can eliminate the for loop with clues from how the original np.percentage is written.

def weighted_percentile(a, q=np.array([75, 25]), w=None):
    """
    Calculates percentiles associated with a (possibly weighted) array

    Parameters
    ----------
    a : array-like
        The input array from which to calculate percents
    q : array-like
        The percentiles to calculate (0.0 - 100.0)
    w : array-like, optional
        The weights to assign to values of a.  Equal weighting if None
        is specified

    Returns
    -------
    values : np.array
        The values associated with the specified percentiles.  
    """
    # Standardize and sort based on values in a
    q = np.array(q) / 100.0
    if w is None:
        w = np.ones(a.size)
    idx = np.argsort(a)
    a_sort = a[idx]
    w_sort = w[idx]

    # Get the cumulative sum of weights
    ecdf = np.cumsum(w_sort)

    # Find the percentile index positions associated with the percentiles
    p = q * (w.sum() - 1)

    # Find the bounding indices (both low and high)
    idx_low = np.searchsorted(ecdf, p, side='right')
    idx_high = np.searchsorted(ecdf, p + 1, side='right')
    idx_high[idx_high > ecdf.size - 1] = ecdf.size - 1

    # Calculate the weights 
    weights_high = p - np.floor(p)
    weights_low = 1.0 - weights_high

    # Extract the low/high indexes and multiply by the corresponding weights
    x1 = np.take(a_sort, idx_low) * weights_low
    x2 = np.take(a_sort, idx_high) * weights_high

    # Return the average
    return np.add(x1, x2)

# Sample data
a = np.array([1.0, 2.0, 9.0, 3.2, 4.0], dtype=np.float)
w = np.array([2.0, 1.0, 3.0, 4.0, 1.0], dtype=np.float)

# Make an unweighted "copy" of a for testing
a2 = np.repeat(a, w.astype(np.int))

# Tests with different percentiles chosen
q1 = np.linspace(0.0, 100.0, 11)
q2 = np.linspace(5.0, 95.0, 10)
q3 = np.linspace(4.0, 94.0, 10)
for q in (q1, q2, q3):
    assert np.all(weighted_percentile(a, q, w) == np.percentile(a2, q))
grovduck
  • 406
  • 5
  • 13
  • This is useful. However, I had to wrap `idx_high[idx_high > ecdf.size - 1] = ecdf.size - 1` in a conditional statement in order to make it work for single percentiles as well. Guess that's why there's the `zerod` in the numpy source code. – Peter9192 Feb 22 '18 at 15:26
4

The weightedcalcs package supports quantiles:

import weightedcalcs as wc
import pandas as pd

df = pd.DataFrame({'v': [1, 2, 3], 'w': [3, 2, 1]})
calc = wc.Calculator('w')  # w designates weight

calc.quantile(df, 'v', 0.5)
# 1.5
Max Ghenis
  • 14,783
  • 16
  • 84
  • 132
3

As mentioned in comments, simply repeating values is impossible for float weights, and impractical for very large datasets. There is a library that does weighted percentiles here: http://kochanski.org/gpk/code/speechresearch/gmisclib/gmisclib.weighted_percentile-module.html It worked for me.

Qwerty
  • 183
  • 10
2

I use this function for my needs:

def quantile_at_values(values, population, weights=None):
    values = numpy.atleast_1d(values).astype(float)
    population = numpy.atleast_1d(population).astype(float)
    # if no weights are given, use equal weights
    if weights is None:
        weights = numpy.ones(population.shape).astype(float)
        normal = float(len(weights))
    # else, check weights                  
    else:                                           
        weights = numpy.atleast_1d(weights).astype(float)
        assert len(weights) == len(population)
        assert (weights >= 0).all()
        normal = numpy.sum(weights)                    
        assert normal > 0.
    quantiles = numpy.array([numpy.sum(weights[population <= value]) for value in values]) / normal
    assert (quantiles >= 0).all() and (quantiles <= 1).all()
    return quantiles
  • It is vectorized as far as I could go.
  • It has a lot of sanity checks.
  • It works with floats as weights.
  • It can work without weights (→ equal weights).
  • It can compute multiple quantiles at once.

Multiply results by 100 if you want percentiles instead of quantiles.

PiHalbe
  • 171
  • 5
  • 1
    nb this returns the quantile at value as the function says, interesting and related but not answering the OP which asks about percentiles ( and no percentile != quantile * 100 ) – ILoveCoding Jun 25 '15 at 15:59
2
def weighted_percentile(a, percentile = np.array([75, 25]), weights=None):
    """
    O(nlgn) implementation for weighted_percentile.
    """
    percentile = np.array(percentile)/100.0
    if weights is None:
        weights = np.ones(len(a))
    a_indsort = np.argsort(a)
    a_sort = a[a_indsort]
    weights_sort = weights[a_indsort]
    ecdf = np.cumsum(weights_sort)

    percentile_index_positions = percentile * (weights.sum()-1)+1
    # need the 1 offset at the end due to ecdf not starting at 0
    locations = np.searchsorted(ecdf, percentile_index_positions)

    out_percentiles = np.zeros(len(percentile_index_positions))

    for i, empiricalLocation in enumerate(locations):
        # iterate across the requested percentiles 
        if ecdf[empiricalLocation-1] == np.floor(percentile_index_positions[i]):
            # i.e. is the percentile in between 2 separate values
            uppWeight = percentile_index_positions[i] - ecdf[empiricalLocation-1]
            lowWeight = 1 - uppWeight

            out_percentiles[i] = a_sort[empiricalLocation-1] * lowWeight + \
                                 a_sort[empiricalLocation] * uppWeight
        else:
            # i.e. the percentile is entirely in one bin
            out_percentiles[i] = a_sort[empiricalLocation]

    return out_percentiles

This is my function, it give identical behaviour to

np.percentile(np.repeat(a, weights), percentile)

With less memory overhead. np.percentile is an O(n) implementation so it's potentially faster for small weights. It has all the edge cases sorted out - it's an exact solution. The interpolation answers above assume linear, when it's a step for most of the case, except when the weight is 1.

Say we have data [1,2,3] with weights [3, 11, 7] and I want the 25% percentile. My ecdf is going to be [3, 10, 21] and I'm looking for the 5th value. The interpolation will see [3,1] and [10, 2] as the matches and interpolate giving 1.28 despite being entirely in the 2nd bin with a value of 2.

nayyarv
  • 71
  • 3
1

Unfortunately, numpy doesn't have built-in weighted functions for everything, but, you can always put something together.

def weight_array(ar, weights):
     zipped = zip(ar, weights)
     weighted = []
     for a, w in zipped:
         for j in range(w):
             weighted.append(a)
     return weighted


np.percentile(weight_array(ar, weights), 25)
user443854
  • 7,096
  • 13
  • 48
  • 63
Joan Smith
  • 931
  • 6
  • 15
  • 1
    To add to this solution, you might try just `np.percentile(Counter(dict(zip(ar, weights)).elements()), 25)`. You'd need to `from collections import Counter`, and it doesn't do well with repeated keys in `ar`, but `Counter().elements()` is neat! – colcarroll Feb 18 '14 at 04:27
  • 29
    you are supposing weights to be integers – Ruggero Turra Aug 07 '14 at 07:44
  • 11
    Also, it will likely use a lot of excess memory and CPU time for storing and sorting, respectively. Not suited for huge amount of data. – PiHalbe Sep 26 '14 at 10:04
-1

here my solution:

def my_weighted_perc(data,perc,weights=None):
    if weights==None:
        return nanpercentile(data,perc)
    else:
        d=data[(~np.isnan(data))&(~np.isnan(weights))]
        ix=np.argsort(d)
        d=d[ix]
        wei=weights[ix]
        wei_cum=100.*cumsum(wei*1./sum(wei))
        return interp(perc,wei_cum,d)

it simply calculates the weighted CDF of the data and then it uses to estimate the weighted percentiles.

Luca Jokull
  • 181
  • 2
  • 9