7

I've recently used both of these functions, and am looking for input from anyone who can speak to the following:

  • do argsort and rankdata differ fundamentally in their purpose?
  • are there performance advantages with one over the other? (specifically: large vs small array performance differences?)
  • what is the memory overhead associated with importing rankdata?

Thanks in advance.

p.s. I could not create the new tags 'argsort' or 'rankdata'. If anyone with sufficient standing feels they should be added to this question, please do.

Boreal Coder
  • 107
  • 1
  • 4
  • 2
    Look at the `rankdata` code. One of the first things it does is `argsort`. But it doesn't simply return that sort. What kind of data are you sorting? – hpaulj Mar 23 '18 at 16:07
  • `argsort` *by itself* does not rank. See [this question and answers](https://stackoverflow.com/questions/5284646/rank-items-in-an-array-using-python-numpy/) for more information about ranking with `argsort` or `rankdata`. – Warren Weckesser Mar 23 '18 at 17:49
  • @hpaulj I was working with the UCI Mushroom Data Set for a machine learning class, and needed to rank features by importance. – Boreal Coder Mar 23 '18 at 18:23
  • @WarrenWeckesser thanks - I had read that thread you supplied (and contributed to) and it seemed to favour argsort over rankdata, though with the caveat that calling argsort twice could have a performance penalty. The original comments are now a few years old - has either function been optimized since to the extent that it's a clear winner for ranking an array? – Boreal Coder Mar 23 '18 at 18:28
  • The accepted answer there shows the efficient way to rank using `argsort`. Other than starting at 0 instead of 1, it provides the same ranking as `rankdata(x, method='ordinal')`. If that is the ranking that you need, then go ahead and use the `argsort` method in the accepted answer--it has a bit less overhead than `rankdata`, so it will be a bit faster. (Note that it will handle repeated values differently.) `rankdata` is great if you don't want to implement your own ranking function, or if you need one of the other ranking methods that it provides. – Warren Weckesser Mar 23 '18 at 19:10

1 Answers1

6

Do argsort and rankdata differ fundamentally in their purpose?

In my opinion, they do slightly. The first gives you the positions of the data if the data was sorted, while the second the rank of the data. The difference can become apparent in the case of ties:

import numpy as np
from scipy import stats

a = np.array([ 5, 0.3,  0.4, 1, 1, 1, 3, 42])
almost_ranks = np.empty_like(a)
almost_ranks[np.argsort(a)] = np.arange(len(a))
print(almost_ranks)
print(almost_ranks+1)
print(stats.rankdata(a))

Results to (notice 3. 4. 5 vs. 4. 4. 4 ):

[6. 0. 1. 2. 3. 4. 5. 7.]
[7. 1. 2. 3. 4. 5. 6. 8.]
[7. 1. 2. 4. 4. 4. 6. 8.]

Are there performance advantages with one over the other? (specifically: large vs small array performance differences?)

Both algorithms seem to me to have the same complexity: O(NlgN) I would expect the numpy implementation to be slightly faster as it has a bit of a smaller overhead, plus it's numpy. But you should test this yourself... Checking the code for scipy.rankdata, it seems to -at present, my python...- be calling np.unique among other functions, so i would guess it would take more in practice...

what is the memory overhead associated with importing rankdata?

Well, you import scipy, if you had not done so before, so it is the overhead of scipy...

ntg
  • 12,950
  • 7
  • 74
  • 95