7

I have two numpy arrays that have overlapping rows:

import numpy as np

a = np.array([[1,2], [1,5], [3,4], [3,5], [4,1], [4,6]])
b = np.array([[1,5], [3,4], [4,6]])

You can assume that:

  1. the rows are sorted
  2. the rows within each array is unique
  3. array b is always subset of array a

I would like to get an array that contains all rows of a that are not in b.

i.e.,:

[[1 2]
 [3 5]
 [4 1]]

Considering that a and b can be very, very large, what is the most efficient method for solving this problem?

slaw
  • 6,591
  • 16
  • 56
  • 109
  • You mention the rows are sorted. Is the full array also sorted column-wise? – mtrw Sep 04 '16 at 21:56
  • Other recent row set questions: (intersection) http://stackoverflow.com/questions/39218768/find-numpy-vectors-in-a-set-quickly/39220519#39220519, (union) http://stackoverflow.com/questions/39083549/python-2-d-array-get-the-function-as-np-unique-or-union1d – hpaulj Sep 04 '16 at 21:57
  • Padraic - I think there are better duplicates than that. It dates from 2012, and there have been many questions about row sets or unique rows since then. – hpaulj Sep 04 '16 at 22:20
  • @hpaulj, feel from to reopen and re-dupe but if you look at the answer below it seems to be almost a literal copy of this highest rated answer http://stackoverflow.com/a/11903368/2141635 from the dupe. – Padraic Cunningham Sep 05 '16 at 00:45
  • How many columns? Always 2? – hpaulj Sep 05 '16 at 12:41
  • Yes, always two columns – slaw Sep 06 '16 at 05:23

1 Answers1

7

Here's a possible solution to your problem:

import numpy as np

a = np.array([[1, 2], [3, 4], [3, 5], [4, 1], [4, 6]])
b = np.array([[3, 4], [4, 6]])

a1_rows = a.view([('', a.dtype)] * a.shape[1])
a2_rows = b.view([('', b.dtype)] * b.shape[1])
c = np.setdiff1d(a1_rows, a2_rows).view(a.dtype).reshape(-1, a.shape[1])
print c

I think using numpy.setdiff1d is the right choice here

BPL
  • 9,632
  • 9
  • 59
  • 117
  • 1
    Use the `assume_unique=True` parameter if applicable. Also look at the code for this function and `np.in1d`. It might give ideas on how do things faster. Other `row set` questions have proposed other ways of converting `a` and `b` to 1d for use in these functions. – hpaulj Sep 05 '16 at 16:54