152

I have an array of numbers and I'd like to create another array that represents the rank of each item in the first array. I'm using Python and NumPy.

For example:

array = [4,2,7,1]
ranks = [2,1,3,0]

Here's the best method I've come up with:

array = numpy.array([4,2,7,1])
temp = array.argsort()
ranks = numpy.arange(len(array))[temp.argsort()]

Are there any better/faster methods that avoid sorting the array twice?

smci
  • 32,567
  • 20
  • 113
  • 146
joshayers
  • 3,269
  • 4
  • 23
  • 19

11 Answers11

139

Use argsort twice, first to obtain the order of the array, then to obtain ranking:

array = numpy.array([4,2,7,1])
order = array.argsort()
ranks = order.argsort()

When dealing with 2D (or higher dimensional) arrays, be sure to pass an axis argument to argsort to order over the correct axis.

k.rooijers
  • 1,407
  • 2
  • 12
  • 6
  • 3
    Note that if numbers are repeated in your input array (eg. `[4,2,7,1,1]`) the output will rank those numbers based on their array position (`[3,2,4,0,1]`) – rcoup Jun 08 '11 at 01:53
  • 4
    Sorting twice is inefficient. @Sven Marnach's answer shows how to accomplish the ranking with a single call to `argsort`. – Warren Weckesser Jul 13 '15 at 15:13
  • 6
    @WarrenWeckesser: I just tested the difference between the two, and you're right for large arrays, but for anything smaller (n < 100), double argsort is faster (about 20% faster for n=100, and about 5 times faster for n=10). So if you have to do a lot of rankings across lots of small sets of values, this method is *much* better. – naught101 Aug 18 '15 at 02:49
  • That's a good point--I see roughly the same effect. For large enough input, the improved efficiency wins, but for small inputs, the double argsort apparently has less overhead. – Warren Weckesser Aug 18 '15 at 03:03
  • 3
    @WarrenWeckesser: Actually, I'm wrong, this method is hands-down better. Both methods are far faster than the scipy.stats method, too. Results: https://gist.github.com/naught101/14042d91a2d0f18a6ae4 – naught101 Aug 18 '15 at 03:03
  • 3
    @naught101: There is a bug in your script. The line `array = np.random.rand(10)` should be `array = np.random.rand(n)`. – Warren Weckesser Aug 18 '15 at 03:25
  • Wouldn't the following also be equivalent without having to sort twice? array = numpy.array([4,2,7,1]); order = array.argsort(); ranks = order[order] – Cesar Dec 18 '15 at 17:51
  • I couldn't find an explanation as to why double argsort works - I hope this helps http://www.berkayantmen.com/rank – bantmen Oct 20 '19 at 21:57
  • 1
    Doesn't work on an input with repeated elements. – ABCD Jan 09 '22 at 15:42
  • This method doesn't work when there is a tie in data – Mohammad Tehrani Nov 08 '22 at 19:08
129

This question is a few years old, and the accepted answer is great, but I think the following is still worth mentioning. If you don't mind the dependence on scipy, you can use scipy.stats.rankdata:

In [22]: from scipy.stats import rankdata

In [23]: a = [4, 2, 7, 1]

In [24]: rankdata(a)
Out[24]: array([ 3.,  2.,  4.,  1.])

In [25]: (rankdata(a) - 1).astype(int)
Out[25]: array([2, 1, 3, 0])

A nice feature of rankdata is that the method argument provides several options for handling ties. For example, there are three occurrences of 20 and two occurrences of 40 in b:

In [26]: b = [40, 20, 70, 10, 20, 50, 30, 40, 20]

The default assigns the average rank to the tied values:

In [27]: rankdata(b)
Out[27]: array([ 6.5,  3. ,  9. ,  1. ,  3. ,  8. ,  5. ,  6.5,  3. ])

method='ordinal' assigns consecutive ranks:

In [28]: rankdata(b, method='ordinal')
Out[28]: array([6, 2, 9, 1, 3, 8, 5, 7, 4])

method='min' assigns the minimum rank of the tied values to all the tied values:

In [29]: rankdata(b, method='min')
Out[29]: array([6, 2, 9, 1, 2, 8, 5, 6, 2])

See the docstring for more options.

Warren Weckesser
  • 110,654
  • 19
  • 194
  • 214
  • 1
    yeah, this is the best answer anywhere where edge cases are important. – naught101 Oct 20 '15 at 02:54
  • I find it interesting that `rankdata` seems to use the same mechanism as the accepted answer to generate the initial ranking internally. – AlexV May 17 '19 at 15:35
  • This is a great answer, it also works nice for my pandas DataFrames where I want to rank elements in each row of a 2D DataFrame but specifying `axis=0` in `rankdata()` and handle ties at the same time. – Laurin Herbsthofer Dec 01 '22 at 09:23
99

Use advanced indexing on the left-hand side in the last step:

array = numpy.array([4,2,7,1])
temp = array.argsort()
ranks = numpy.empty_like(temp)
ranks[temp] = numpy.arange(len(array))

This avoids sorting twice by inverting the permutation in the last step.

Sven Marnach
  • 574,206
  • 118
  • 941
  • 841
  • 3
    Perfect, thank you! I knew there was a solution and it would seem obvious once I saw it. I did some testing with timeit, and this method is slightly slower for small arrays. On my machine they're equal when the array has 2,000 elements. At 20,000 elements, your method is about 25% faster. – joshayers Mar 12 '11 at 20:33
  • any recommendation on how to do this rowwise? – Xaser Nov 19 '17 at 14:09
  • For more than 1 dim see answer below. – mathtick Jan 22 '20 at 16:17
  • Just a _minor_ correction, the indirection `ranks[temp]` isn't technically *slicing*, but *indexing* (see https://numpy.org/doc/stable/reference/arrays.indexing.html) – Shmil The Cat Jan 19 '21 at 17:38
  • 1
    Looks a bit like the [implementation](https://github.com/scipy/scipy/blob/v1.6.3/scipy/stats/stats.py#L8172) of `scipy.stats.rankdata` (for ordinal rank). – djvg Jun 16 '21 at 17:14
  • This method doesn't work when there is a tie in data – Mohammad Tehrani Nov 08 '22 at 19:10
9

Use argsort() twice will do it:

>>> array = [4,2,7,1]
>>> ranks = numpy.array(array).argsort().argsort()
>>> ranks
array([2, 1, 3, 0])
djvg
  • 11,722
  • 5
  • 72
  • 103
Kwong
  • 115
  • 1
  • 3
6

For a vectorized version of an averaged rank, see below. I love np.unique, it really widens the scope of what code can and cannot be efficiently vectorized. Aside from avoiding python for-loops, this approach also avoids the implicit double loop over 'a'.

import numpy as np

a = np.array( [4,1,6,8,4,1,6])

a = np.array([4,2,7,2,1])
rank = a.argsort().argsort()

unique, inverse = np.unique(a, return_inverse = True)

unique_rank_sum = np.zeros_like(unique)
np.add.at(unique_rank_sum, inverse, rank)
unique_count = np.zeros_like(unique)
np.add.at(unique_count, inverse, 1)

unique_rank_mean = unique_rank_sum.astype(np.float) / unique_count

rank_mean = unique_rank_mean[inverse]

print rank_mean
Eelco Hoogendoorn
  • 10,459
  • 1
  • 44
  • 42
  • by the way; I made this code to produce the same output as the other averaged rank code, but I can imagine the minimum rank of a group of repeating numbers works just as well. This can be obtained even more easily as >>> unique, index, inverse = np.unique(a, True, True) >>> rank_min = rank[index][inverse] – Eelco Hoogendoorn Jan 06 '14 at 20:47
  • I am getting the following error with your solution(numpy 1.7.1): AttributeError: 'numpy.ufunc' object has no attribute 'at' – Fear Jul 14 '17 at 09:33
  • This requires a more recent version of numpy; yours is quite ancient – Eelco Hoogendoorn Jul 14 '17 at 10:08
5

I tried to extend both solution for arrays A of more than one dimension, supposing you process your array row-by-row (axis=1).

I extended the first code with a loop on rows; probably it can be improved

temp = A.argsort(axis=1)
rank = np.empty_like(temp)
rangeA = np.arange(temp.shape[1])
for iRow in xrange(temp.shape[0]): 
    rank[iRow, temp[iRow,:]] = rangeA

And the second one, following k.rooijers suggestion, becomes:

temp = A.argsort(axis=1)
rank = temp.argsort(axis=1)

I randomly generated 400 arrays with shape (1000,100); the first code took about 7.5, the second one 3.8.

Dr Fabio Gori
  • 1,105
  • 16
  • 21
4

Apart from the elegance and shortness of solutions, there is also the question of performance. Here is a little benchmark:

import numpy as np
from scipy.stats import rankdata
l = list(reversed(range(1000)))

%%timeit -n10000 -r5
x = (rankdata(l) - 1).astype(int)
>>> 128 µs ± 2.72 µs per loop (mean ± std. dev. of 5 runs, 10000 loops each)

%%timeit -n10000 -r5
a = np.array(l)
r = a.argsort().argsort()
>>> 69.1 µs ± 464 ns per loop (mean ± std. dev. of 5 runs, 10000 loops each)

%%timeit -n10000 -r5
a = np.array(l)
temp = a.argsort()
r = np.empty_like(temp)
r[temp] = np.arange(len(a))
>>> 63.7 µs ± 1.27 µs per loop (mean ± std. dev. of 5 runs, 10000 loops each)
Mischa Lisovyi
  • 3,207
  • 18
  • 29
  • 2
    Good idea, but for a fair comparison, you should use `rankdata(l, method='ordinal') - 1`. – Warren Weckesser Apr 20 '20 at 13:45
  • 2
    Also better to use a randomly generated array - `l = random.choices(range(500),k=10000)` - varying sample, size & iterations for fuller picture. Double slicing from @yupbank looks interesting – Peter Jun 02 '21 at 05:11
2

I tried the above methods, but failed because I had many zeores. Yes, even with floats duplicate items may be important.

So I wrote a modified 1D solution by adding a tie-checking step:

def ranks (v):
    import numpy as np
    t = np.argsort(v)
    r = np.empty(len(v),int)
    r[t] = np.arange(len(v))
    for i in xrange(1, len(r)):
        if v[t[i]] <= v[t[i-1]]: r[t[i]] = r[t[i-1]]
    return r

# test it
print sorted(zip(ranks(v), v))

I believe it's as efficient as it can be.

h2kyeong
  • 447
  • 3
  • 13
2

argsort and slice are symmetry operations.

try slice twice instead of argsort twice. since slice is faster than argsort

array = numpy.array([4,2,7,1])
order = array.argsort()
ranks = np.arange(array.shape[0])[order][order]
yupbank
  • 263
  • 1
  • 13
0

I liked the method by k.rooijers, but as rcoup wrote, repeated numbers are ranked according to array position. This was no good for me, so I modified the version to postprocess the ranks and merge any repeated numbers into a combined average rank:

import numpy as np
a = np.array([4,2,7,2,1])
r = np.array(a.argsort().argsort(), dtype=float)
f = a==a
for i in xrange(len(a)):
   if not f[i]: continue
   s = a == a[i]
   ls = np.sum(s)
   if ls > 1:
      tr = np.sum(r[s])
      r[s] = float(tr)/ls
   f[s] = False

print r  # array([ 3. ,  1.5,  4. ,  1.5,  0. ])

I hope this might help others too, I tried to find anothers solution to this, but couldn't find any...

0

More general version of one of the answers:

In [140]: x = np.random.randn(10, 3)

In [141]: i = np.argsort(x, axis=0)

In [142]: ranks = np.empty_like(i)

In [143]: np.put_along_axis(ranks, i, np.repeat(np.arange(x.shape[0])[:,None], x.shape[1], axis=1), axis=0)

See How to use numpy.argsort() as indices in more than 2 dimensions? to generalize to more dims.

mathtick
  • 6,487
  • 13
  • 56
  • 101