38

Say I have these 2D arrays A and B.

How can I remove elements from A that are in B. (Complement in set theory: A-B)

A=np.asarray([[1,1,1], [1,1,2], [1,1,3], [1,1,4]])
B=np.asarray([[0,0,0], [1,0,2], [1,0,3], [1,0,4], [1,1,0], [1,1,1], [1,1,4]])
#output = [[1,1,2], [1,1,3]]

To be more precise, I would like to do something like this.

data = some numpy array
label = some numpy array
A = np.argwhere(label==0) #[[1 1 1], [1 1 2], [1 1 3], [1 1 4]]
B = np.argwhere(data>1.5) #[[0 0 0], [1 0 2], [1 0 3], [1 0 4], [1 1 0], [1 1 1], [1 1 4]]
out = np.argwhere(label==0 and data>1.5) #[[1 1 2], [1 1 3]]
FObersteiner
  • 22,500
  • 8
  • 42
  • 72
Jee Seok Yoon
  • 4,716
  • 9
  • 32
  • 47

5 Answers5

36

there is an easy solution with a list comprehension,

A = [i for i in A if i not in B]

Result

[[1, 1, 2], [1, 1, 3]]

List comprehension is not removing the elements from the array, it's just reassigning - if you want to remove the elements, use this method:

for i in B:
     if i in A:
     A.remove(i)
FObersteiner
  • 22,500
  • 8
  • 42
  • 72
Rahul K P
  • 15,740
  • 4
  • 35
  • 52
18

Based on this solution to Find the row indexes of several values in a numpy array, here's a NumPy based solution with less memory footprint and could be beneficial when working with large arrays -

dims = np.maximum(B.max(0),A.max(0))+1
out = A[~np.in1d(np.ravel_multi_index(A.T,dims),np.ravel_multi_index(B.T,dims))]

Sample run -

In [38]: A
Out[38]: 
array([[1, 1, 1],
       [1, 1, 2],
       [1, 1, 3],
       [1, 1, 4]])

In [39]: B
Out[39]: 
array([[0, 0, 0],
       [1, 0, 2],
       [1, 0, 3],
       [1, 0, 4],
       [1, 1, 0],
       [1, 1, 1],
       [1, 1, 4]])

In [40]: out
Out[40]: 
array([[1, 1, 2],
       [1, 1, 3]])

Runtime test on large arrays -

In [107]: def in1d_approach(A,B):
     ...:     dims = np.maximum(B.max(0),A.max(0))+1
     ...:     return A[~np.in1d(np.ravel_multi_index(A.T,dims),\
     ...:                     np.ravel_multi_index(B.T,dims))]
     ...: 

In [108]: # Setup arrays with B as large array and A contains some of B's rows
     ...: B = np.random.randint(0,9,(1000,3))
     ...: A = np.random.randint(0,9,(100,3))
     ...: A_idx = np.random.choice(np.arange(A.shape[0]),size=10,replace=0)
     ...: B_idx = np.random.choice(np.arange(B.shape[0]),size=10,replace=0)
     ...: A[A_idx] = B[B_idx]
     ...: 

Timings with broadcasting based solutions -

In [109]: %timeit A[np.all(np.any((A-B[:, None]), axis=2), axis=0)]
100 loops, best of 3: 4.64 ms per loop # @Kasramvd's soln

In [110]: %timeit A[~((A[:,None,:] == B).all(-1)).any(1)]
100 loops, best of 3: 3.66 ms per loop

Timing with less memory footprint based solution -

In [111]: %timeit in1d_approach(A,B)
1000 loops, best of 3: 231 µs per loop

Further performance boost

in1d_approach reduces each row by considering each row as an indexing tuple. We can do the same a bit more efficiently by introducing matrix-multiplication with np.dot, like so -

def in1d_dot_approach(A,B):
    cumdims = (np.maximum(A.max(),B.max())+1)**np.arange(B.shape[1])
    return A[~np.in1d(A.dot(cumdims),B.dot(cumdims))]

Let's test it against the previous on much larger arrays -

In [251]: # Setup arrays with B as large array and A contains some of B's rows
     ...: B = np.random.randint(0,9,(10000,3))
     ...: A = np.random.randint(0,9,(1000,3))
     ...: A_idx = np.random.choice(np.arange(A.shape[0]),size=10,replace=0)
     ...: B_idx = np.random.choice(np.arange(B.shape[0]),size=10,replace=0)
     ...: A[A_idx] = B[B_idx]
     ...: 

In [252]: %timeit in1d_approach(A,B)
1000 loops, best of 3: 1.28 ms per loop

In [253]: %timeit in1d_dot_approach(A, B)
1000 loops, best of 3: 1.2 ms per loop
Community
  • 1
  • 1
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • Your in1d_approach takes 30 sec, in1d_dot_approach takes 45 sec in my program. My numpy array uses dtype=np.uint8. So I tested it with your exact code with dtype=np.uint8 parameter for A, B. Dot function gives me 567 nano seconds, original takes me 539 nano seconds. Any explanation why smaller data type gives the original function better timing? – Jee Seok Yoon Oct 27 '16 at 10:00
  • 1
    @JeeSeokYoon Well simply because less precision datatypes would occupy less memory in terms of their binary bits and thus would incur less memory occupancy and that in most cases translates to faster processing as its processing less data because lesser number of binary bits are used to represent each number. You gotta keep in mind that at the lowest level, CPUs process binary data. Hope that made sense! – Divakar Oct 27 '16 at 10:03
  • I'm asking why is in1d_approach(A,B) slower than in1d_dot_approach(A, B) when dealing with floating point, but faster for integer? Is it just how numpy was built? Why does matrix multiplication perform better with floating point/ worse with integers (compared to other methods)? – Jee Seok Yoon Oct 27 '16 at 11:18
  • Pay attention to the possibility of `dims = np.maximum(B.max(0),A.max(0))+1` overrun. You might have to convert the `dims` array to a longer `dtype` before the `+1` addition. – SzieberthAdam May 27 '22 at 10:09
13

Here is a Numpythonic approach with broadcasting:

In [83]: A[np.all(np.any((A-B[:, None]), axis=2), axis=0)]
Out[83]: 
array([[1, 1, 2],
       [1, 1, 3]])

Here is a timeit with other answer:

In [90]: def cal_diff(A, B):
   ....:     A_rows = A.view([('', A.dtype)] * A.shape[1])
   ....:     B_rows = B.view([('', B.dtype)] * B.shape[1])
   ....:     return np.setdiff1d(A_rows, B_rows).view(A.dtype).reshape(-1, A.shape[1])
   ....: 

In [93]: %timeit cal_diff(A, B)
10000 loops, best of 3: 54.1 µs per loop

In [94]: %timeit A[np.all(np.any((A-B[:, None]), axis=2), axis=0)]
100000 loops, best of 3: 9.41 µs per loop

# Even better with Divakar's suggestion
In [97]: %timeit A[~((A[:,None,:] == B).all(-1)).any(1)]
100000 loops, best of 3: 7.41 µs per loop

Well, if you are looking for a faster way you should looking for ways that reduce the number of comparisons. In this case (without considering the order) you can generate a unique number from your rows and compare the numbers which can be done with summing the items power of two.

Here is the benchmark with Divakar's in1d approach:

In [144]: def in1d_approach(A,B):
   .....:         dims = np.maximum(B.max(0),A.max(0))+1
   .....:         return A[~np.in1d(np.ravel_multi_index(A.T,dims),\
   .....:                          np.ravel_multi_index(B.T,dims))]
   .....: 

In [146]: %timeit in1d_approach(A, B)
10000 loops, best of 3: 23.8 µs per loop

In [145]: %timeit A[~np.in1d(np.power(A, 2).sum(1), np.power(B, 2).sum(1))]
10000 loops, best of 3: 20.2 µs per loop

You can use np.diff to get the an order independent result:

In [194]: B=np.array([[0, 0, 0,], [1, 0, 2,], [1, 0, 3,], [1, 0, 4,], [1, 1, 0,], [1, 1, 1,], [1, 1, 4,], [4, 1, 1]])

In [195]: A[~np.in1d(np.diff(np.diff(np.power(A, 2))), np.diff(np.diff(np.power(B, 2))))]
Out[195]: 
array([[1, 1, 2],
       [1, 1, 3]])

In [196]: %timeit A[~np.in1d(np.diff(np.diff(np.power(A, 2))), np.diff(np.diff(np.power(B, 2))))]
10000 loops, best of 3: 30.7 µs per loop

Benchmark with Divakar's setup:

In [198]: B = np.random.randint(0,9,(1000,3))

In [199]: A = np.random.randint(0,9,(100,3))

In [200]: A_idx = np.random.choice(np.arange(A.shape[0]),size=10,replace=0)

In [201]: B_idx = np.random.choice(np.arange(B.shape[0]),size=10,replace=0)

In [202]: A[A_idx] = B[B_idx]

In [203]: %timeit A[~np.in1d(np.diff(np.diff(np.power(A, 2))), np.diff(np.diff(np.power(B, 2))))]
10000 loops, best of 3: 137 µs per loop

In [204]: %timeit A[~np.in1d(np.power(A, 2).sum(1), np.power(B, 2).sum(1))]
10000 loops, best of 3: 112 µs per loop

In [205]: %timeit in1d_approach(A, B)
10000 loops, best of 3: 115 µs per loop

Timing with larger arrays (Divakar's solution is slightly faster):

In [231]: %timeit A[~np.in1d(np.diff(np.diff(np.power(A, 2))), np.diff(np.diff(np.power(B, 2))))]
1000 loops, best of 3: 1.01 ms per loop

In [232]: %timeit A[~np.in1d(np.power(A, 2).sum(1), np.power(B, 2).sum(1))]
1000 loops, best of 3: 880 µs per loop

In [233]:  %timeit in1d_approach(A, B)
1000 loops, best of 3: 807 µs per loop
Mazdak
  • 105,000
  • 18
  • 159
  • 188
  • 1
    Nice! Was about to post the same! – Divakar Oct 15 '16 at 07:08
  • 2
    Actually using `equality` might be better on performance : `A[~((A[:,None,:] == B).all(-1)).any(1)]`. – Divakar Oct 15 '16 at 07:09
  • Great Solution :) – Rahul K P Oct 15 '16 at 07:09
  • @Divakar Indeed, that's Nice! – Mazdak Oct 15 '16 at 07:14
  • Nice Answer! Kindly answer to this too regarding the speedup http://stackoverflow.com/questions/40056275/how-does-numpy-broadcasting-perform-faster – R. S. Nikhil Krishna Oct 15 '16 at 07:22
  • Could you do a verification that the result with the new approach is same as with the other approaches? Use the setup I used for that? – Divakar Oct 15 '16 at 08:42
  • Nice! Since the timings are very close and in `micro-sec` range, test them out on larger arrays : `B = np.random.randint(0,9,(10000,3))` and `A = np.random.randint(0,10000,(1000,3))`? Also I have added an improved version if you would like to add to those tests. – Divakar Oct 15 '16 at 09:16
  • Oh wait, the new approach would give wrong answer if let's say I flip a row. So, do : `A[1] = B[5][::-1]` after setting up the input arrays and then run it. – Divakar Oct 15 '16 at 09:26
  • @Divakar I mentioned that it's not depends on order, otherwise you should use second approach with `diff`. – Mazdak Oct 15 '16 at 09:31
7

If you want to do it the numpy way,

import numpy as np

A = np.array([[1, 1, 1,], [1, 1, 2], [1, 1, 3], [1, 1, 4]])
B = np.array([[0, 0, 0], [1, 0, 2], [1, 0, 3], [1, 0, 4], [1, 1, 0], [1, 1, 1], [1, 1, 4]])
A_rows = A.view([('', A.dtype)] * A.shape[1])
B_rows = B.view([('', B.dtype)] * B.shape[1])

diff_array = np.setdiff1d(A_rows, B_rows).view(A.dtype).reshape(-1, A.shape[1])

As @Rahul suggested, for a non numpy easy solution,

diff_array = [i for i in A if i not in B]
R. S. Nikhil Krishna
  • 3,962
  • 1
  • 13
  • 27
4

Another non-numpy solution:

[i for i in A if i not in B]
Yihe
  • 4,094
  • 2
  • 19
  • 21