1

I have two matrices A(m X 3) and B(n X 3); where m >> n.

Numbers in B have close or equal values to numbers in A.

I want to search closest possible values from A to the values present in B in a way that at the end of search, A will reduced to (n X 3).

There are two main issues:

  1. Only a complete row from A can be compared to a complete row in B, where numbers in each column of A and B are varying independently.
  2. Numbers in A and B may be as close as third place of decimal (e.g. 20.101 and 20.103)

I hope I am clear in asking my question. Does anybody know about any function already present in matlab for this thing?

Mushi
  • 367
  • 2
  • 4
  • 14
  • 1
    How do you define closeness between complete rows (according to issue 1)? Which is the metric? – Luis Mendo Feb 25 '14 at 20:13
  • [This](http://stackoverflow.com/a/15081259/1586200) answer should give you an idea about the solution. – Autonomous Feb 26 '14 at 06:53
  • I just want to highlight that the requested output is the size of `B`, making this a search for the closest point in `A`. Tolerances really don't enter into it, it seems, unless you can accept a smaller output. My answer handles it, but it's still not 100% clear how this should be done. – chappjc Feb 26 '14 at 17:23

3 Answers3

4

Depending on how you look at the task, here are two different approaches

Minimum Distance to Each Row in Second Matrix

Two ways to look at this: (1) closest point in A for each point in B, or (2) closest point in B for each point in A.

Closest point in A

For each point in B you can find the closest point in A (e.g. Euclidean distance), as requested in comments:

% Calculate all MxN high-dimensional (3D space) distances at once
distances = squeeze(sum(bsxfun(@minus,B,permute(A,[3 2 1])).^2,2));

% Find closest row in A for each point in B
[~,ik] = min(distances,[],2)

Make an array the size of B containing these closest points in A:

Anew = A(ik,:)

This will implicitly throw out any points in A that are too far from points in B, as long as each point in B does have a match in A. If each point in B does not necessarily have a "match" (point at an acceptable distance) in A, then it is necessary to actively reject points based on distances, resulting in an output that would be shorter than B. This solution seems out of scope.

Closest point in B

Compute the Euclidenan distance from each point (row) in A to each point in B and identify the closest point in B:

distances = squeeze(sum(bsxfun(@minus,A,permute(B,[3 2 1])).^2,2));
[~,ik] = min(distances,[],2)

Make an array the size of A containing these closest points in B:

Anew = B(ik,:)

The size of Anew in this approach is the same as A.

Merging Similar Points in First Matrix

Another approach is to use the undocumented _mergesimpts function.

Consider this test data:

>> B = randi(5,4,3)
B =
     1     4     4
     2     3     4
     1     3     4
     3     4     5
>> tol = 0.001;
>> A = repmat(B,3,1) + tol * rand(size(B,1)*3,3)
A =
    1.0004    4.0005    4.0000
    2.0004    3.0005    4.0008
    1.0004    3.0009    4.0002
    3.0008    4.0005    5.0004
    1.0006    4.0004    4.0007
    2.0008    3.0007    4.0004
    1.0009    3.0007    4.0007
    3.0010    4.0005    5.0004
    1.0002    4.0003    4.0007
    2.0001    3.0001    4.0007
    1.0007    3.0006    4.0004
    3.0001    4.0003    5.0000

Merge similar rows in A according to a specified tolerance, tol:

>> builtin('_mergesimpts',A,tol,'average')
ans =
    1.0004    4.0004    4.0005
    1.0007    3.0007    4.0005
    2.0005    3.0005    4.0006
    3.0006    4.0004    5.0003

Merge similar rows, using B to get expected numbers

>> builtin('_mergesimpts',[A; B],tol,'first')
ans =
     1     3     4
     1     4     4
     2     3     4
     3     4     5
Community
  • 1
  • 1
chappjc
  • 30,359
  • 6
  • 75
  • 132
  • Why has nobody mentioned the most obvious solution, `knnsearch`? Am I missing something? Is `knnsearch` slower than doing `pdist2`? – Autonomous Feb 26 '14 at 05:36
  • @Parag Same reason I didn't suggest pdist2 -- they're both in the Stats toolbox. I try to minimize dependencies when possible. Although, I do break that rule when it comes to the Image Processing toolbox. It would still be a good answer. Go for it! :) – chappjc Feb 26 '14 at 06:10
  • Dear chappjc, thanks for your suggestions. I would like to go with your first approach but there is problem. I don't want values of B to replace values in A. I just want to find values in A which are closest to values in B keeping B as it is. As a last step, I want to remove all extra values of A which do not have any matching in B. How can I achieve this with your 1st approach. Sorry, I am relatively newer to programming. – Mushi Feb 26 '14 at 09:29
  • @Mushi You can achieve it using this (see my solution, second part): `[~, ii] = sort(min(distances,[],2)); Anew = A(ii(1:n),:);` – Luis Mendo Feb 26 '14 at 11:04
  • @Mushi I think the way to do that is to reverse the "direction" distance computations. I've just updated my answer. The first half contains the two directions. You are asking for the closest points in `A` for each point in `B`, the very first part of the updated answer. – chappjc Feb 26 '14 at 17:33
1

To replace each row of A by the closest row of B

You can use pdist2 to compute distance between rows, and then the second output of min to find the index of the minimum-distance row:

[~, ind] = min(pdist2(B,A,'euclidean')); %// or specify some other distance
result = B(ind,:);

The advantage of this approach is that pdist2 lets you specify other distance functions, or even define your own. For example, to use L1 distance change first line to

[~, ind] = min(pdist2(B,A,'cityblock'));

To retain rows of A which are closest to rows of B

Use pdist2 as above. For each row of A compute the minimum distance to rows of B. Retain the n rows of A with lowest value of that minimum distance:

[~, ii] = sort(min(pdist2(B,A,'euclidean'))); %// or use some other distance
result = A(ii(1:n),:);
Luis Mendo
  • 110,752
  • 13
  • 76
  • 147
  • 1
    Thanks for your suggestion. However, for the second part, couldn't you instead look in the opposite direction (see my update)? That is, for each row in `B` find the closest row in `A`, thus ending up with an array that is the size of `B`, containing those points in `A` that were closest? The `sort` is a nice solution too. (+1) But you could end up choosing several points from `A` that match a certain point in `B` before each point in `B` gets a match at all. Perhaps that's the idea of the OP? I'm still not certain. – chappjc Feb 26 '14 at 17:13
  • @chappjc Yes, it can be interpreted in either way. I hadn't thought of that. – Luis Mendo Feb 26 '14 at 17:35
0

Try this code -

%% Create data
m=7;
n=4;
TOL = 0.0005;

A = rand(m,3)/100;
B = rand(n,3)/100;
B(2,:) = A(5,:); % For testing that the matching part of the second row from B must be the fifth row from A

%% Interesting part
B2 = repmat(reshape(B',1,3,n),[m 1]);
closeness_matrix = abs(bsxfun(@minus, A, B2));
closeness_matrix(closeness_matrix<TOL)=0;
closeness_matrix_mean = mean(closeness_matrix,2);  % Assuming that the "closeness" for the triplets on each row can be measured by the mean value of them
X1 = squeeze(closeness_matrix_mean);

[minval,minind] = min(X1,[],1);
close_indices = minind';

A_final = A(close_indices,:)
Divakar
  • 218,885
  • 19
  • 262
  • 358