Mapping each value in a list to its percentile of a different distribution

Question

I have a list scores and a list distribution. I need to map each score in scores to what its corresponding percentile would be in distribution.

Example:

distribution=[4,10,3,5,1]
scores = [1,6,11]

The result of the operation should be [20,80,100]

Map each list value to its corresponding percentile This similar question has been asked but in my case, making use of scipy.stats.rankdata is not possible because I need to find the percentile of each item in relation to a different distribution.

The natural way to solve it is [scipy.stats.percentileofscore(distribution,s) for s in scores] but this is extremely slow when scores or distribution is large (length higher than about 10,000 for each).

Is there any way to speed this up considerably? I've tried sorting the distribution list first and then doing a standard search, but the worst case is still quite bad.

score 0 · Answer 1 · answered Dec 01 '17 at 18:58

Look into binning: use your reference distribution as the data set, and the scores as the bin boundaries. The result will be bins of values from the distribution, such as:

[ [1], [4, 3, 5], [10] ]

You now take the length of each bin (which some binning packages return with the binning lists) and divide by the total distribution population; this gives you incremental percentiles:

[0.20, 0.60, 0.20]

From here, the cumulative sum is trivial

[0.20, 0.80, 1.0]

Does that get you moving?

Mapping each value in a list to its percentile of a different distribution

1 Answers1