1

I have a numpy array, for example:

a = np.array([[1,2],
              [3,4],
              [6,4],
              [5,3],
              [3,5]])

and I also have a set

b = set((1,2),(6,4),(9,9))

I want to find the index of vectors that exist in set b, here is

[0, 2]

but I use a for loop to implement this, is there a convinient way to do this job avoiding for loop? The for loop method I used:

record = []
for i in range(a.shape[0]):
    if (a[i, 0], a[i, 1]) in b:
        record.append(i)
maple
  • 1,828
  • 2
  • 19
  • 28
  • as a list comprehension... you can unpack it to make it more readible... np.vstack([a[np.sum(np.where(a == i, 1, 0), axis=1) == 2] for i in b]) ... but basically you want to find out where the entries in b are present in a, when b is changed to an array b = np.array([(1,2),(6,4)]) –  Aug 30 '16 at 04:57
  • Are all elements from that set guaranteed to be somewhere in `a`? – Divakar Aug 30 '16 at 05:03
  • @Divakar No, it doesn't – maple Aug 30 '16 at 05:31
  • So, let's say there's one more element `(9,9)` in the input set. What must be the output then? Why not add such a case into the sample in the question? – Divakar Aug 30 '16 at 05:34
  • @Divakar it won't change the answer. I have updated the question. But I don't know why it is important? – maple Aug 30 '16 at 05:39
  • Do you need this done once for each array element, or possibly many times? Please state how did you actually implement this... (did you use the `in` operator or did you implement the set lookup using a for loop?) – moooeeeep Aug 30 '16 at 06:33
  • Yes, I use in operator. @moooeeeep, and I have added my implemention – maple Aug 30 '16 at 07:24

4 Answers4

1

You can use filter:

In [8]: a = np.array([[1,2],
              [3,4],
              [6,4],
              [5,3],
              [3,5]])

In [9]: b = {(1,2),(6,4)}

In [10]: filter(lambda x: tuple(a[x]) in b, range(len(a)))
Out[10]: [0, 2]
Nehal J Wani
  • 16,071
  • 3
  • 64
  • 89
1

First off, convert the set to a NumPy array -

b_arr = np.array(list(b))

Then, based on this post, you would have three approaches. Let's use the second approach for efficiency -

dims = np.maximum(a.max(0),b_arr.max(0)) + 1
a1D = np.ravel_multi_index(a.T,dims)
b1D = np.ravel_multi_index(b_arr.T,dims)    
out = np.flatnonzero(np.in1d(a1D,b1D))

Sample run -

In [89]: a
Out[89]: 
array([[1, 2],
       [3, 4],
       [6, 4],
       [5, 3],
       [3, 5]])

In [90]: b
Out[90]: {(1, 2), (6, 4), (9, 9)}

In [91]: b_arr = np.array(list(b))

In [92]: dims = np.maximum(a.max(0),b_arr.max(0)) + 1
    ...: a1D = np.ravel_multi_index(a.T,dims)
    ...: b1D = np.ravel_multi_index(b_arr.T,dims)    
    ...: out = np.flatnonzero(np.in1d(a1D,b1D))
    ...: 

In [93]: out
Out[93]: array([0, 2])
Community
  • 1
  • 1
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • Will this method runs slow? it seems it will use all a's element to compare b's element. will it be more faster if set is used because it is a hash method? – maple Aug 30 '16 at 05:37
  • @maple Updated with a more efficient one. This should be much faster. – Divakar Aug 30 '16 at 05:44
0

For reference, a straight forward list comprehension (loop) answer:

In [108]: [i for i,v in enumerate(a) if tuple(v) in b]
Out[108]: [0, 2]

basically the same speed as the filter approach:

In [111]: timeit [i for i,v in enumerate(a) if tuple(v) in b]
10000 loops, best of 3: 24.5 µs per loop

In [114]: timeit list(filter(lambda x: tuple(a[x]) in b, range(len(a))))
10000 loops, best of 3: 29.7 µs per loop

But this is a toy example, so timings aren't meaningful.

If a wasn't already an array, these list approaches would be faster than the array ones, due to the overhead of creating arrays.

There are some numpy set operations, but they work with 1d arrays. We can get around that by converting 2d arrays to 1d structured.

In [117]: a.view('i,i')
Out[117]: 
array([[(1, 2)],
       [(3, 4)],
       [(6, 4)],
       [(5, 3)],
       [(3, 5)]], 
      dtype=[('f0', '<i4'), ('f1', '<i4')])
In [119]: np.array(list(b),'i,i')
Out[119]: 
array([(1, 2), (6, 4), (9, 9)], 
      dtype=[('f0', '<i4'), ('f1', '<i4')])

There is a version of this using np.void, but it's easier to remember and play with this 'i,i' dtype.

So this works:

In [123]: np.nonzero(np.in1d(a.view('i,i'),np.array(list(b),'i,i')))[0]
Out[123]: array([0, 2], dtype=int32)

but it is much slower than the iterations:

In [124]: timeit np.nonzero(np.in1d(a.view('i,i'),np.array(list(b),'i,i')))[0]
10000 loops, best of 3: 153 µs per loop

As discussed in other recent union questions, np.in1d uses several strategies. One is based on broadcasting and where. The other uses unique, concatenation, sorting and differences.

A broadcasting solution (yes, it's messy) - but faster than in1d.

In [150]: timeit np.nonzero((a[:,:,None,None]==np.array(list(b))[:,:]).any(axis=-1).any(axis=-1).all(axis=-1))[0]
10000 loops, best of 3: 52.2 µs per loop
hpaulj
  • 221,503
  • 14
  • 230
  • 353
0

A one line solution using a list comprehension:

In [62]: a = np.array([[1,2],
    ...:               [3,4],
    ...:               [6,4],
    ...:               [5,3],
    ...:               [3,5]])

In [63]: b = set(((1,2),(6,4),(9,9)))
In [64]: where([tuple(e) in b for e in a])[0]
Out[64]: array([0, 2])
bougui
  • 3,507
  • 4
  • 22
  • 27