0

I have an array of 500 numbers and I'd like to calculate the percentile of each value:

import numpy as np
import pandas as pd

df = np.random.random(500)

desired_output = pd.DataFrame(df).rank(pct=True)

Except without using pandas, since I have this in a long loop and it must be as fast as possible.

lara_toff
  • 413
  • 2
  • 14
  • 2
    SciPy has [`rankdata`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.rankdata.html); see https://stackoverflow.com/questions/5284646/rank-items-in-an-array-using-python-numpy-without-sorting-array-twice for more options. – Warren Weckesser Dec 06 '22 at 15:17
  • *"I'd like to calculate the percentile of each value"* Did you see [`numpy.percentile`](https://numpy.org/doc/stable/reference/generated/numpy.percentile.html)? – Warren Weckesser Dec 06 '22 at 15:20
  • You can "rank" data (0 ... n-1) using numpy by `argsort`ing twice: `ranks = df.argsort().argsort()`. You can get the percentile via `desired_output = (ranks + 1) / df.shape[-1]`; note however that it does not deal with ties like pandas, so if that is necessary you will need to do more processing. – Chrysophylaxs Dec 06 '22 at 15:58
  • @WarrenWeckesser didn't think it needed to be said "without iterating through row by row which would be extremely slow". – lara_toff Dec 08 '22 at 17:10

0 Answers0