4

Related, but different, IMHO:

(1) numpy: most efficient frequency counts for unique values in an array

(2) Using Numpy arrays as lookup tables

Setup:

import numpy as np
from scipy.stats import itemfreq

x = np.array([1,  1,  1,  2,  25000, 2,  2,  5,  1,  1])
fq = itemfreq(x)
fq.astype(int)
array([[    1,     5],
       [    2,     3],
       [    5,     1],
       [25000,     1]])

Now, I'd like to use fq as a lookup table, and do this:

res = magic_lookup_function(fq, x)
res
    array([5, 5, 5, 3, 1, 3, 3, 1, 5, 5])

As suggested in (1) and (2), I could convert fq into a python dictionary, then lookup from there, then back to np.array. But is there a cleaner / faster / pure numpy way to do this?

Update: Also, as suggested in (2), I could use bincount, but I fear that might be inefficient if my indices are large, e.g. ~250,000.

Thanks!

UPDATED SOLUTION

As @Jaime pointed out (below), np.unique sorts the array, in O(n log n) time, at best. So I wondered, what happens under the hood with itemfreq? Turns out itemfreq sorts the array, which I'll assume is also O(n log n):

In [875]: itemfreq??

def itemfreq(a):
... ... ...
    scores = _support.unique(a)
    scores = np.sort(scores)

Here's a timeit example

In [895]: import timeit

In [962]: timeit.timeit('fq = itemfreq(x)', setup='import numpy; from scipy.stats import itemfreq; x = numpy.array([ 1,  1,  1,  2, 250000,  2,  2,  5,  1,  1])', number=1000)
Out[962]: 0.3219749927520752

But it seems unnecessary to sort the array. Here's what happens if we do it in pure python.

In [963]: def test(arr):
   .....:     fd = {}
   .....:     for i in arr:
   .....:         fd[i] = fd.get(i,0) + 1
   .....:     return numpy.array([fd[j] for j in arr])

In [967]: timeit.timeit('test(x)', setup='import numpy; from __main__ import test; x = numpy.array([ 1,  1,  1,  2, 250000,  2,  2,  5,  1,  1])', number=1000)
Out[967]: 0.028257131576538086

Wow, 10x faster!

(At least, in this case, where the array is not too long, but may contain large values.)

And, just for reference, as I suspected, doing this with np.bincount is inefficient with large values:

In [970]: def test2(arr):
    bc = np.bincount(arr)
    return bc[arr]

In [971]: timeit.timeit('test2(x)', setup='import numpy; from __main__ import test2; x = numpy.array([ 1,  1,  1,  2, 250000,  2,  2,  5,  1,  1])', number=1000)
Out[971]: 0.0975029468536377
Community
  • 1
  • 1
andros
  • 153
  • 1
  • 6

2 Answers2

4

You can use numpy.searchsorted:

def get_index(arr, val):                                                                
    index = np.searchsorted(arr, val)                                                            
    if arr[index] == val:                                                                        
        return index                                                                             

In [20]: arr = fq[:,:1].ravel()                                                                  

In [21]: arr
Out[21]: array([  1.,   2.,   5.,  25.])

In [22]: get_index(arr, 25)                                                                      
Out[22]: 3

In [23]: get_index(arr, 2)                                                                       
Out[23]: 1

In [24]: get_index(arr, 4)    #returns `None` for  item not found.  
Ashwini Chaudhary
  • 244,495
  • 58
  • 464
  • 504
  • Ah, nice. Although, how efficient is `searchsorted`? It seems to me it would be O(log n) at best (but I could be wrong here). I'd ideally like to find something that is O(1) like python dictionary lookup, but without the overhead of converting back and forth. (This might not be possible...) – andros Mar 25 '14 at 07:19
  • @andros Yes, this is `O(log N)`, and I forgot to mention it also expects the data to be sorted. – Ashwini Chaudhary Mar 25 '14 at 07:21
  • @andros Note however that it's `O(log N)` where the search is done by efficient C code on C types. It wouldn't be strange if `searchsorted` took about the same time as a dictionary lookup even for quite big values of `N`. I'd consider trying and profiling it as a possible solution even though it's asymptotically worse, because it might become worse for such big values of `N` that you don't even consider. – Bakuriu Mar 25 '14 at 07:30
0

Because your look-up table is not just any look-up table, but a list of frequencies, you may want to consider the following option:

>>> x = np.array([1,  1,  1,  2,  25, 2,  2,  5,  1,  1])
>>> x_unq, x_idx = np.unique(x, return_inverse=True)
>>> np.take(np.bincount(x_idx), x_idx)
array([5, 5, 5, 3, 1, 3, 3, 1, 5, 5], dtype=int64)

Even if your look-up table is more complicated, i.e.:

>>> lut = np.array([[ 1, 10],
...                 [ 2,  9],
...                 [ 5,  8],
...                 [25,  7]])

if you can afford to do the np.unique (it sorts the array, so it is n log n) call with return_index, you can work with small consecutive integers as indices, which usually makes things easier, e.g. using np.searchsorted you would do:

>>> np.take(lut[:, 1], np.take(np.searchsorted(lut[:, 0], x_unq), x_idx))
array([10, 10, 10,  9,  7,  9,  9,  8, 10, 10])
Jaime
  • 65,696
  • 17
  • 124
  • 159