5

I have two matrices in Matlab A and B, which have equal number of columns but different number of rows. The number of rows in B is also less than the number of rows in A. B is actually a subset of A.

How can I remove those rows efficiently from A, where the values in columns 1 and 2 of A are equal to the values in columns 1 and 2 of matrix B?

At the moment I'm doing this:

for k = 1:size(B, 1)
     A(find((A(:,1) == B(k,1) & A(:,2) == B(k,2))), :) = [];
end

and Matlab complains that this is inefficient and that I should try to use any, but I'm not sure how to do it with any. Can someone help me out with this? =)

I tried this, but it doesn't work:

A(any(A(:,1) == B(:,1) & A(:,2) == B(:,2), 2), :) = [];

It complains the following:

Error using  == 
Matrix dimensions must agree.

Example of what I want:

enter image description here

enter image description here

A-B in the results means that the rows of B are removed from A. The same goes with A-C.

bla
  • 25,846
  • 10
  • 70
  • 101
jjepsuomi
  • 4,223
  • 8
  • 46
  • 74
  • 1
    `setdiff` is the best solution but to convert your first try to `any` (*keeping* your loop) this is what Matlab is suggesting (you'd actually want `all` and not `any` in your case): `A(all(A == B(k,:),2), :) = [];` – Dan Jun 19 '14 at 06:36
  • +1 Thank you very much @Dan I will try all the solutions and post the performance times =) – jjepsuomi Jun 19 '14 at 06:42
  • 1
    btw I didn't realize you were only comparing the first two columns so update my last comment to `A(all(A(:,1:2) == B(k,1:2),2), :) = [];` – Dan Jun 19 '14 at 06:52
  • 2
    Thank you everybody for your fine answers =) The original running time (with my data) was: 0.198072 seconds. By using the `bsxfun` approaches I got a running time of approximately 0.007 seconds. By using `setdiff(A(:,1:2),B(:,1:2),'rows')` I got the running time: 0.004120 seconds. – jjepsuomi Jun 19 '14 at 06:54
  • 1
    @jjepsuomi Hope you can do some benchmarks on bigger datasizes too, would be interesting to see those results too. – Divakar Jun 19 '14 at 06:57
  • 1
    +1 @Divakar I will try with different data sets and post my results =) It will take few minutes =) – jjepsuomi Jun 19 '14 at 06:59
  • 1
    @Divakar results coming in soon =) – jjepsuomi Jun 19 '14 at 07:26
  • 1
    don't forget the `ismember` solution too... – bla Jun 19 '14 at 07:27
  • @jjepsuomi Added one more `bsxfun` approach in my solution, so do you mind adding that too to your benchmark results? :) – Divakar Jun 19 '14 at 07:36
  • Hi @Divakar I added the results for my datasets =) Okay I can add the one more `bsxfun` approach, just a sec =) – jjepsuomi Jun 19 '14 at 07:46
  • Hi @Divakar I added your second approach as well =) It seems `setdiff` is beating the heck out of all for some reason (with the dataset I have available). Maybe the results could be different if I had much larger datasets? =) Thank anyway for everybody! =) Your solutions are all very good and the performance differences aren't that big that it would make a difference (at least in my case =)). – jjepsuomi Jun 19 '14 at 07:56
  • 1
    @jjepsuomi I think the results certainly make sense, because `bsxfun` is known to be memory hungry, so with those huge datasizes, it's bound to get slower. `setdiff` with its definition looks perfect for this problem. Thank you for the results BTW! – Divakar Jun 19 '14 at 08:27

3 Answers3

4

try using setdiff. for example:

c=setdiff(a,b,'rows')

Note, if order is important use:

c = setdiff(a,b,'rows','stable')

Edit: reading the edited question and the comments to this answer, the specific usage of setdiff you look for is (as noticed by Shai):

[temp c] = setdiff(a(:,1:2),b(:,1:2),'rows','stable')
c = a(c,:)

Alternative solution:

you can just use ismember:

a(~ismember(a(:,1:2),b(:,1:2),'rows'),:)
bla
  • 25,846
  • 10
  • 70
  • 101
  • 3
    +1 But don't you need `setdiff(A(:,1:2),B(:,1:2),'rows')` instead? – Divakar Jun 19 '14 at 06:42
  • 1
    When I wrote my answer there was an example in the question of two arrays similar to those in the answer that are now edited out. That what I always write: "for example,..." if you understand the answer you can apply it to the question anyway. – bla Jun 19 '14 at 07:00
  • 1
    @jjepsuomi Could post back on the screenshot image you had in the post before the edits? – Divakar Jun 19 '14 at 07:03
  • bygones Divakar :) ... the question was answered 3 times already. – bla Jun 19 '14 at 07:04
  • @natan I really thought the screenshot made it easier for everyone to understand. – Divakar Jun 19 '14 at 07:06
  • @Divakar I posted the pic, but there's some problem in the server I think, because it doesn't display it? – jjepsuomi Jun 19 '14 at 07:07
  • @natan well that would give you first two columns only as `c`. Look into [Shai's solution](http://stackoverflow.com/a/24300103/3293881), it has the correct setdiff implementation using the first two columns, I believe. – Divakar Jun 19 '14 at 07:09
  • @natan Sorry, it was messy, but for correctness, it was necessary I guess :) I think you can keep it, but just state the assumption that its for all columns and not just first and second column. Upto you! – Divakar Jun 19 '14 at 07:13
  • 1
    from all the mess I thought of an alternative solution with `ismember`... :) – bla Jun 19 '14 at 07:17
  • 1
    @natan haha way to avoid the mess! Out of +1s :) – Divakar Jun 19 '14 at 07:19
2

Use :

compare = bsxfun( @eq, permute( A(:,1:2), [1 3 2]), permute( B(:,1:2), [3 1 2] ) );
twoEq = all( compare, 3 );
toRemove = any( twoEq, 2 ); 
A( toRemove, : ) = [];

Explaining the code:

First we use bsxfun to compare all pairs of first to column of A and B, resulting with compare of size numRowsA-by-numRowsB-by-2 with true where compare( ii, jj, kk ) = A(ii,kk) == B(jj,kk).
Then we use all to create twoEq of size numRowsA-by-numRowsB where each entry indicates if both corresponding entries of A and B are equal.
Finally, we use any to select rows of A that matches at least one row of B.

What's wrong with original code:

By removing rows of A inside a loop (i.e., A( ... ) = []) you actually resizing A at almost each iteration. See this post on why exactly this is a bad practice.

Using setdiff

In order to use setdiff (as suggested by natan) on only the first two columns you'll need use it's second output argument:

[ignore, ia] = setdiff( A(:,1:2), B(:,1:2), 'rows', 'stable' );
A = A( ia, : ); % keeping only relevant rows, beyond first two columns.
Community
  • 1
  • 1
Shai
  • 111,146
  • 38
  • 238
  • 371
2

Here's another bsxfun implementation -

A(~any(squeeze(all(bsxfun(@eq,A(:,1:2),permute(B(:,1:2),[3 2 1])),2)),2),:)

One more that is dangerously close to Shai's solution, but still avoids two permute to one permute -

A(~any(all(bsxfun(@eq,A(:,1:2),permute(B(:,1:2),[3 2 1])),2),3),:)
Community
  • 1
  • 1
Divakar
  • 218,885
  • 19
  • 262
  • 358