0

I have two matrices:

a = [[1,3,4],[2,5,3],[2,4,6],[6,5,3]] 
b = [[2,4,5],[2,4,6],[1,3,4]] 

and I want to choose [2,4,6],[1,3,4] in b, which is in a.

Since a and b are large,

for v in b:
    if  v in a:

is expensive.

Can anybody tell me the best way to do this ?

Lingbing
  • 30
  • 5
  • 1
    So, what `a` and `b` _really_ are: Python lists of lists, numpy vectors or numpy matrices? (There are three different data types, and for large `a` and `b` the conversion may be more expensive than the computations.) – DYZ May 22 '17 at 01:06
  • 1
    Is `b` significantly smaller than `a`? If so iteration like this on `b` may be the best option. Another route to test is sets. Another variable is the relative size of the sublists. – hpaulj May 22 '17 at 01:55
  • http://stackoverflow.com/questions/40055835/removing-elements-from-an-array-that-are-in-another-array has some time tests for a variant on this problem – hpaulj May 22 '17 at 02:13

1 Answers1

1

What you want is an equivalent of numpy.in1d, for 2-dimensional matrices. I wrote such a function a while ago

def in2d(arr1, arr2):
    """Generalisation of numpy.in1d to 2D arrays"""

    assert arr1.dtype == arr2.dtype

    arr1_view = np.ascontiguousarray(arr1).view(np.dtype((np.void,
         arr1.dtype.itemsize * arr1.shape[1])))
    arr2_view = np.ascontiguousarray(arr2).view(np.dtype((np.void,
         arr2.dtype.itemsize * arr2.shape[1])))
    intersected = np.in1d(arr1_view, arr2_view)
    return intersected.view(np.bool).reshape(-1)

Explanation on how it works can be found here. You can use the function like this

In [56]: a = np.array([[1,3,4],[2,5,3],[2,4,6],[6,5,3]])
In [57]: b = np.array([[2,4,5],[2,4,6],[1,3,4]])

In [58]: in2d(b,a)
Out[58]: array([False,  True,  True], dtype=bool) 

It returns an array of boolean of which elements of b are in a. Or vice versa

In [59]: in2d(a,b)
Out[59]: array([ True, False,  True, False], dtype=bool)

Indexing a with this boolean array gives you exactly what you want

In [60]: a[in2d(a,b),:]
Out[60]: 
array([[1, 3, 4],
       [2, 4, 6]])

Note that your solution (posted below), is incorrect and does not do what you think it does, in that if v in a searches all nested arrays/lists elements. So the following comparison is not fair, nevertheless, consider

def for_loop_and_compare(a,b):
    return np.array([v for v in b if v in a])

And the timings

In [61]: a=np.random.randint(0,100,(10000,3))
In [62]: b=np.random.randint(0,100,(1000,3))
In [63]: %timeit for_loop_and_compare(a,b)
10 loops, best of 3: 79 ms per loop
In [64]: %timeit in2d(a,b)
100 loops, best of 3: 3.7 ms per loop
Community
  • 1
  • 1
romeric
  • 2,325
  • 3
  • 19
  • 35
  • Even assuming that `a` and `b` are already numpy arrays, your function takes 4 times longer than the OP's vanilla solution. Conversion from a nested list to an array adds more overhead. – DYZ May 22 '17 at 01:57
  • For large numpy arrays, this function is much more efficient than OP's solution. – romeric May 22 '17 at 02:01
  • I tried it for an array with 5000 (sic) elements, preconverted to numpy. – DYZ May 22 '17 at 02:03
  • Using a view to turn the problem into one that `in1d` or `unique` can handle is a common answer for these `unique rows` or `row intersection` questions. But the relative speed depends heavily on the size of the problem, so it's hard to generalize which is better. It's a good idea to look at the `in1d` code to see how solves the problem. The array structure is not designed for fast searches. – hpaulj May 22 '17 at 02:08
  • Can you post your benchmark? What you do within `for v in b:` really matters – romeric May 22 '17 at 02:09
  • 1
    Also note with OP's data `if v in a:` does not do what one might think it does – romeric May 22 '17 at 02:19
  • @DYZ I have posted the timings now. – romeric May 22 '17 at 02:50
  • Thanks a lot ! this in2d is just what I want , it works perfectly. – Lingbing May 22 '17 at 04:13