Related, but different, IMHO:
(1) numpy: most efficient frequency counts for unique values in an array
(2) Using Numpy arrays as lookup tables
Setup:
import numpy as np
from scipy.stats import itemfreq
x = np.array([1, 1, 1, 2, 25000, 2, 2, 5, 1, 1])
fq = itemfreq(x)
fq.astype(int)
array([[ 1, 5],
[ 2, 3],
[ 5, 1],
[25000, 1]])
Now, I'd like to use fq as a lookup table, and do this:
res = magic_lookup_function(fq, x)
res
array([5, 5, 5, 3, 1, 3, 3, 1, 5, 5])
As suggested in (1) and (2), I could convert fq into a python dictionary, then lookup from there, then back to np.array. But is there a cleaner / faster / pure numpy way to do this?
Update: Also, as suggested in (2), I could use bincount, but I fear that might be inefficient if my indices are large, e.g. ~250,000.
Thanks!
UPDATED SOLUTION
As @Jaime pointed out (below), np.unique sorts the array, in O(n log n) time, at best. So I wondered, what happens under the hood with itemfreq
? Turns out itemfreq
sorts the array, which I'll assume is also O(n log n):
In [875]: itemfreq??
def itemfreq(a):
... ... ...
scores = _support.unique(a)
scores = np.sort(scores)
Here's a timeit example
In [895]: import timeit
In [962]: timeit.timeit('fq = itemfreq(x)', setup='import numpy; from scipy.stats import itemfreq; x = numpy.array([ 1, 1, 1, 2, 250000, 2, 2, 5, 1, 1])', number=1000)
Out[962]: 0.3219749927520752
But it seems unnecessary to sort the array. Here's what happens if we do it in pure python.
In [963]: def test(arr):
.....: fd = {}
.....: for i in arr:
.....: fd[i] = fd.get(i,0) + 1
.....: return numpy.array([fd[j] for j in arr])
In [967]: timeit.timeit('test(x)', setup='import numpy; from __main__ import test; x = numpy.array([ 1, 1, 1, 2, 250000, 2, 2, 5, 1, 1])', number=1000)
Out[967]: 0.028257131576538086
Wow, 10x faster!
(At least, in this case, where the array is not too long, but may contain large values.)
And, just for reference, as I suspected, doing this with np.bincount
is inefficient with large values:
In [970]: def test2(arr):
bc = np.bincount(arr)
return bc[arr]
In [971]: timeit.timeit('test2(x)', setup='import numpy; from __main__ import test2; x = numpy.array([ 1, 1, 1, 2, 250000, 2, 2, 5, 1, 1])', number=1000)
Out[971]: 0.0975029468536377