0

I have a set of clusters consisting of 3D points. I want to get the nearest two points from each two clusters.

For example: I have 5 clusters C1 to C5 consisting of a 3D points. For C1 and C2 there are two points Pc1 "point in C1" and Pc2 "point in C2" that are the closet two points between the two clusters C1 and C2, same between C1 and C3..C5 and same between C2 and C3..C5 and so on. After that I'll have 20 points representing the nearest points between the different clusters.

The second thing is that I want to connect this points together if the distance between each of them and the other is less than a certain distance "threshold".

So I'm asking if anyone could please advise me

Update:

Thanks Amro for your answer, I've updated it to CIDX=kmeans(X, K,'distance','cityblock', 'replicates',5); to solve the empty cluster error. But another error appeared "pdistmex Out of memory. Type HELP MEMORY for your options." So I've checked your answer here: Out of memory error while using clusterdata in MATLAB and updated your code as below but the problem now is that there is now an indexing error in this code mn = min(min(D(idx1,idx2))); I'm asking if there is a workaround for this error?

Code used:

%function  single_linkage(depth,clrr)
X = randn(5000,3);
%X=XX;
% clr = clrr;
K=7;
clr = jet(K);
%// cluster into K=4
K = 7;
%CIDX = kmeans(X,K);


%// pairwise distances
SUBSET_SIZE = 1000;            %# subset size
ind = randperm(size(X,1));
data = X(ind(1:SUBSET_SIZE), :);
D = squareform(pdist(data));
subs = 1:size(D,1);
CIDX=kmeans(D, K,'distance','sqEuclidean', 'replicates',5);
centers = zeros(K, size(data,2));
for i=1:size(data,2)
    centers(:,i) = accumarray(CIDX, data(:,i), [], @mean);
end

%# calculate distance of each instance to all cluster centers
D = zeros(size(X,1), K);
for k=1:K
    D(:,k) = sum( bsxfun(@minus, X, centers(k,:)).^2, 2);
end
%D=squareform(D);
%# assign each instance to the closest cluster
[~,clustIDX] = min(D, [], 2);
%// for each pair of clusters
cpairs = nchoosek(1:K,2);
pairs = zeros(size(cpairs)); 
dists = zeros(size(cpairs,1),1);
for i=1:size(cpairs,1)
    %// index of points assigned to each of the two cluster
    idx1 = (clustIDX == cpairs(i,1));
    idx2 = (clustIDX == cpairs(i,2));

    %// shortest distance between the two clusters
    mn = min(min(D(idx1,idx2)));
    dists(i) = mn;

    %// corresponding pair of points with the minimum distance
    [r,c] = find(D(idx1,idx2)==mn);
    s1 = subs(idx1); s2 = subs(idx2);
    pairs(i,:) = [s1(r) s2(c)];
end

%// filter pairs by keeping only those whose distances is below a threshold
thresh = inf;
cpairs(dist>thresh,:) = [];

%// plot 3D points color-coded by clusters
figure('renderer','zbuffer')
%clr = lines(K);
h = zeros(1,K);
for i=1:K
h(i) = line(X(CIDX==i,1), X(CIDX==i,2), X(CIDX==i,3), ...
    'Color',clr(i,:), 'LineStyle','none', 'Marker','.', 'MarkerSize',5);
end
legend(h, num2str((1:K)', 'C%d'))   %'
view(3), axis vis3d, grid on

%// mark and connect nearest points between each pair of clusters
for i=1:size(pairs,1)
    line(X(pairs(i,:),1), X(pairs(i,:),2), X(pairs(i,:),3), ...
        'Color','k', 'LineStyle','-', 'LineWidth',3, ...
        'Marker','o', 'MarkerSize',10);
end
Community
  • 1
  • 1
Tak
  • 3,536
  • 11
  • 51
  • 93
  • start by looking [`pdist2`](http://www.mathworks.com/help/stats/pdist2.html) command. – Shai Aug 07 '13 at 05:51
  • there is this related [SO question](http://stackoverflow.com/questions/18086052/find-all-points-within-a-range-to-any-point-of-an-other-set) that make use of [kd-tree](http://www.mathworks.com/help/stats/kdtreesearcher.rangesearch.html) – marsei Aug 07 '13 at 07:36

1 Answers1

1

What you are asking for sounds similar to what single-linkage clustering does at each step; from the bottoms-up, clusters separated by the shortest distance are combined.

Anyway below is the brute-force way of solving this. I'm sure there are more efficient implementations, but this one is easy to implement.

%// data of 3D points
X = randn(5000,3);

%// cluster into K=4
K = 4;
CIDX = kmeans(X,K);

%// pairwise distances
D = squareform(pdist(X));
subs = 1:size(X,1);

%// for each pair of clusters
cpairs = nchoosek(1:K,2);
pairs = zeros(size(cpairs));
dists = zeros(size(cpairs,1),1);
for i=1:size(cpairs,1)
    %// index of points assigned to each of the two cluster
    idx1 = (CIDX == cpairs(i,1));
    idx2 = (CIDX == cpairs(i,2));

    %// shortest distance between the two clusters
    mn = min(min(D(idx1,idx2)));
    dists(i) = mn;

    %// corresponding pair of points with the minimum distance
    [r,c] = find(D(idx1,idx2)==mn);
    s1 = subs(idx1); s2 = subs(idx2);
    pairs(i,:) = [s1(r) s2(c)];
end

%// filter pairs by keeping only those whose distances is below a threshold
thresh = inf;    %// use your threshold value instead
cpairs(dists>thresh,:) = [];

%// plot 3D points color-coded by clusters
figure('renderer','zbuffer')
clr = lines(K);
h = zeros(1,K);
for i=1:K
    h(i) = line(X(CIDX==i,1), X(CIDX==i,2), X(CIDX==i,3), ...
        'Color',clr(i,:), 'LineStyle','none', ...
        'Marker','.', 'MarkerSize',5);
end
legend(h, num2str((1:K)', 'C%d'))   %'
view(3), axis vis3d, grid on

%// mark and connect nearest points between each pair of clusters
for i=1:size(pairs,1)
    line(X(pairs(i,:),1), X(pairs(i,:),2), X(pairs(i,:),3), ...
        'Color','k', 'LineStyle','-', 'LineWidth',3, ...
        'Marker','o', 'MarkerSize',10);
end

3d points


Note that in the above example the data is randomly generated and not very interesting, so it is hard to see the connected nearest points.

Just for fun, here is another result where I simply replaced the min-distance by the max-distance between pair of clusters (similar to complete-linkage clustering), i.e use:

mx = max(max(D(idx1,idx2)));

instead of the previous:

mn = min(min(D(idx1,idx2)));

max linkage

which shows how we connect the farthest points between each pair of clusters. This visualization is a bit more interesting in my opinion :)

Amro
  • 123,847
  • 25
  • 243
  • 454
  • Thanks, I've updated it to `CIDX=kmeans(X, K,'distance','cityblock', 'replicates',5);` to solve the empty cluster error. But another error appeared "pdistmex Out of memory. Type HELP MEMORY for your options." The input X value I'm using can be found here https://www.dropbox.com/s/y2gfioixz2rt083/XX.mat – Tak Aug 09 '13 at 04:08
  • @user1460166: Your data X is of size `37425*3`, which means `pdist` will compute and allocate a matrix of size `37425*37425` which requires over 10GB of memory! (half of that to be exact which is still around `5GB`). Like I said before, I was sloppy in my implementation and I was simply trying to keep it simple. For such large data, you have to be careful about how much memory you are using. – Amro Aug 09 '13 at 12:34
  • For example we dont need to compute the entire distance matrix `D` all at one, all we need is find the closest (or farthest like I showed last) two points between each pair of clusters, which can be done with almost no memory by looping over the set of points belonging to a pair of clusters, computing the distances one at-a-time, and keeping track of the minimum distance found, along with the corresponding indices. Of course this is a trade-off between speed vs. memory.. – Amro Aug 09 '13 at 12:40
  • Thanks for the suggestion. I agree that looping will take a lot of time, and that's why I think your answer is a much better solution. I'm trying to solve the `pdistmex Out of memory` error using the solution proposed by you here http://stackoverflow.com/questions/2946087/out-of-memory-error-while-using-clusterdata-in-matlab but I keep stucking with other errors, is it possible? what do you think? – Tak Aug 10 '13 at 02:40
  • I've updated my question and provided it with the code used and the my attempts to solve my problem. I'm asking if you could have a look whenever you got the time please. Thanks – Tak Aug 10 '13 at 16:55
  • @user1460166: I'm not sure how that post applies to you. The goal was to cluster a dataset, but to avoid the problem of large data, we only clustered a subset of the data, and "extrapolated" the results to the rest of the instances. In your case, you already have the clusters, but you want to find the closest pairs of points between each two clusters, so you are bound to compute all those distances. The trick like I explained is to avoid storing the entire distance matrix in memory all at once, and to only compute them one-at-a-time. – Amro Aug 10 '13 at 20:01