1

I'd like to calculate the percentile of each value in a list (or numpy array), weighted by weights in another list. For example, given some f I'd like:

x = [1, 2, 3, 4]
weights = [2, 2, 3, 3]
f(x, weights)

to yield [20, 40, 70, 100].

I can calculate the unweighted percentile for a single item using

from scipy import stats
stats.percentileofscore(x, 3)
# 75.0

Per Map each list value to its corresponding percentile I can also calculate this for each using

[stats.percentileofscore(x, a, 'rank') for a in x]
# [25.0, 50.0, 75.0, 100.0]

And per Weighted version of scipy percentileofscore I can calculate a single item's weighted percentile using:

def weighted_percentile_of_score(x, weights, score, kind='weak'):
    npx = np.array(x)
    npw = np.array(weights)

    if kind == 'rank':  # Equivalent to 'weak' since we have weights.
        kind = 'weak'

    if kind in ['strict', 'mean']:
        indx = npx < score
        strict = 100 * sum(npw[indx]) / sum(weights)
    if kind == 'strict':
        return strict

    if kind in ['weak', 'mean']:    
        indx = npx <= score
        weak = 100 * sum(npw[indx]) / sum(weights)
    if kind == 'weak':
        return weak

    if kind == 'mean':
        return (strict + weak) / 2

Called as:

weighted_percentile_of_score(x, weights, 3))  # 70.0 as desired.

How do I do this (efficiently) for each item in the list?

Max Ghenis
  • 14,783
  • 16
  • 84
  • 132

2 Answers2

2

Adapting this answer to Weighted percentile using numpy you can sort the arrays and then divide the cumsum of weights by the total weight:

def weighted_percentileofscore(values, weights=None, values_sorted=False):
    """ Similar to scipy.percentileofscore, but supports weights.
    :param values: array-like with data.
    :param weights: array-like of the same length as `values`.
    :param values_sorted: bool, if True, then will avoid sorting of initial array.
    :return: numpy.array with percentiles of sorted array.
    """
    values = np.array(values)
    if weights is None:
        weights = np.ones(len(values))
    weights = np.array(weights)

    if not values_sorted:
        sorter = np.argsort(values)
        values = values[sorter]
        weights = weights[sorter]

    total_weight = weights.sum()
    return 100 * np.cumsum(weights) / total_weight

Verifying:

weighted_percentileofscore(x, weights)
# array([20., 40., 70., 100. ])

If unsorted arrays are passed you'd have to map it back to the original ordering, so best to sort first.

This should be considerably faster than calculating separately for each value.

Max Ghenis
  • 14,783
  • 16
  • 84
  • 132
0

This isn't very efficient, but you can combine the approaches listed in the question:

[weighted_percentile_of_score(x, weights, val) for val in x]
# [20.0, 40.0, 70.0, 100.0]
Max Ghenis
  • 14,783
  • 16
  • 84
  • 132