3

Suppose I have a numpy array of arrays of length 4:

In [41]: arr
Out[41]:
array([[  1,  15,   0,   0],
       [ 30,  10,   0,   0],
       [ 30,  20,   0,   0],
       ...,
       [104, 139, 146,  75],
       [  9,  11, 146,  74],
       [  9, 138, 146,  75]], dtype=uint8)

I want to know:

  1. Is it true that arr includes [1, 2, 3, 4]?
  2. If it true what index of [1, 2, 3, 4] in arr?

I want to find out it as fast as it possible.

Suppose arr contains 8550420 elements. I've checked several methods with timeit:

  1. Just for checking without getting index: any(all([1, 2, 3, 4] == elt) for elt in arr). It tooks 15.5 sec in average on 10 runs on my machine
  2. for-based solution:

    for i,e in enumerate(arr): if list(e) == [1, 2, 3, 4]: break

    It tooks about 5.7 secs in average

Does exists some faster solutions, for example numpy based?

petRUShka
  • 9,812
  • 12
  • 61
  • 95
  • 1
    if you're not concerned about extra memory you could create dictionary where keys are tuples created from your lists – Roman Pekar Jul 23 '13 at 13:12
  • But will it save some time? It will take time to make dictionary of tuples from my array. – petRUShka Jul 23 '13 at 13:14
  • well it depends, if you perform your search more than once, than it definitely will save you some time – Roman Pekar Jul 23 '13 at 13:15
  • I think this answer might help - http://stackoverflow.com/a/17797247/2452770. Using `where` in conjunction with `all` is probably what gives you the fastest find. – A.Wan Jul 23 '13 at 13:37

2 Answers2

6

This is Jaime's idea, I just love it:

import numpy as np

def asvoid(arr):
    """View the array as dtype np.void (bytes)
    This collapses ND-arrays to 1D-arrays, so you can perform 1D operations on them.
    https://stackoverflow.com/a/16216866/190597 (Jaime)"""    
    arr = np.ascontiguousarray(arr)
    return arr.view(np.dtype((np.void, arr.dtype.itemsize * arr.shape[-1])))

def find_index(arr, x):
    arr_as1d = asvoid(arr)
    x = asvoid(x)
    return np.nonzero(arr_as1d == x)[0]


arr = np.array([[  1,  15,   0,   0],
                [ 30,  10,   0,   0],
                [ 30,  20,   0,   0],
                [1, 2, 3, 4],
                [104, 139, 146,  75],
                [  9,  11, 146,  74],
                [  9, 138, 146,  75]], dtype='uint8')

arr = np.tile(arr,(1221488,1))
x = np.array([1,2,3,4], dtype='uint8')

print(find_index(arr, x))

yields

[      3      10      17 ..., 8550398 8550405 8550412]

The idea is to view each row of the array as a string. For example,

In [15]: x
Out[15]: 
array([^A^B^C^D], 
      dtype='|V4')

The strings look like garbage, but they are really just the underlying data in each row viewed as bytes. You can then compare arr_as1d == x to find which rows equal x.


There is another way to do it:

def find_index2(arr, x):
    return np.where((arr == x).all(axis=1))[0]

but it turns out to be not as fast:

In [34]: %timeit find_index(arr, x)
1 loops, best of 3: 209 ms per loop

In [35]: %timeit find_index2(arr, x)
1 loops, best of 3: 370 ms per loop
Community
  • 1
  • 1
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • 1
    I don't love it. Views can be a very nice trick, but you have to be careful about contiguiuty and if you use this on a float array you get `-0. != 0.` (does not matter for these uint8, but...). Is this even any faster then just using `arr.all` for each row? The view trick makes more sense to me if you want to use sorting methods because you need to find the index often. – seberg Jul 23 '13 at 13:51
  • Well, loving is perhaps too strong a word, but it seems to fill a need for which there is sometimes no other option. And in this case, it is faster than `np.all(..., axis=1)`. – unutbu Jul 23 '13 at 13:55
  • Sorry, there is a ascontiguous array in there anyway, so that part is fine :) – seberg Jul 23 '13 at 13:57
  • No, you were correct; I just added it. And thanks for the warning regarding `-0. != 0.` – unutbu Jul 23 '13 at 13:58
0

If you perform search more than one time and you don't mind to use extra memory, you can create set from you array (I'm using list here, but it's almost the same code):

>>> elem = [1, 2, 3, 4]    
>>> elements = [[  1,  15,   0,   0], [ 30,  10,   0,   0], [1, 2, 3, 4]]
>>> index = set([tuple(x) for x in elements])
>>> True if tuple(elem) in index else False
True
Roman Pekar
  • 107,110
  • 28
  • 195
  • 197
  • Don't even need the conditional statement. `tuple(... index` should be fine on itself. ( I can't check though; on my phone) – TerryA Jul 23 '13 at 13:26