1

I want to split this data,

ID x    y
1  2.5  3.5
1  85.1 74.1
2  2.6  3.4
2  86.0 69.8
3  25.8 32.9
3  84.4 68.2
4  2.8  3.2
4  24.1 31.8
4  83.2 67.4

I was able, making match with their partner like,

ID x    y    ID x    y   
1  2.5  3.5  1  85.1 74.1
2  2.6  3.4  2  86.0 69.8
             3  25.8 32.9
             4  24.1 31.8

However, as you notice some of the new row in ID 4 were placed wrong, because it just got added in the next few rows. I want to split them properly without having to use complex logic which I am already using... Someone can give me an algorithm or idea?

it should looks like,

ID x    y    ID x    y      ID x    y 
1  2.5  3.5  1  85.1 74.1   3  25.8 32.9
2  2.6  3.4  2  86.0 69.8   4  24.1 31.8
4  2.8  3.2  3  84.4 68.2
             4  83.2 67.4
SpcCode
  • 887
  • 2
  • 13
  • 24

1 Answers1

1

It seems that your question is really about clustering, and that the ID column has nothing to do with the determining which points correspond to which.

A common algorithm to achieve that would be k-means clustering. However, your question implies that you don't know the number of clusters in advance. This complicates matters, and there have been already a lot of questions asked here on StackOverflow regarding this issue:

  1. Kmeans without knowing the number of clusters?
  2. compute clustersize automatically for kmeans
  3. How do I determine k when using k-means clustering?
  4. How to optimal K in K - Means Algorithm
  5. K-Means Algorithm

Unfortunately, there is no "right" solution for this. Two clusters in one specific problem could be indeed considered as one cluster in another problem. This is why you'll have to decide that for yourself.

Nevertheless, if you're looking for something simple (and probably inaccurate), you can use Euclidean distance as a measure. Compute the distances between points (e.g. using pdist), and group points where the distance falls below a certain threshold.

Example

%// Sample input
A = [1,  2.5,  3.5;
     1,  85.1, 74.1;
     2,  2.6,  3.4;
     2,  86.0, 69.8;
     3,  25.8, 32.9;
     3,  84.4, 68.2;
     4,  2.8,  3.2;
     4,  24.1, 31.8;
     4,  83.2, 67.4];

%// Cluster points
pairs = nchoosek(1:size(A, 1), 2); %// Rows of pairs
d = sqrt(sum((A(pairs(:, 1), :) - A(pairs(:, 2), :)) .^ 2, 2)); %// d = pdist(A)
thr = d < 10;                      %// Distances below threshold
kk = 1;
idx = 1:size(A, 1);
C = cell(size(idx));               %// Preallocate memory
while any(idx)
     x = unique(pairs(pairs(:, 1) == find(idx, 1) & thr, :));
     C{kk} = A(x, :);
     idx(x) = 0;                   %// Remove indices from list
     kk = kk + 1;
end
C = C(~cellfun(@isempty, C));      %// Remove empty cells

The result is a cell array C, each cell representing a cluster:

C{1} =
    1.0000    2.5000    3.5000
    2.0000    2.6000    3.4000
    4.0000    2.8000    3.2000

C{2} =
    1.0000   85.1000   74.1000
    2.0000   86.0000   69.8000
    3.0000   84.4000   68.2000
    4.0000   83.2000   67.4000

C{3} = 
    3.0000   25.8000   32.9000
    4.0000   24.1000   31.8000

Note that this simple approach has the flaw of restricting the cluster radius to the threshold. However, you wanted a simple solution, so bear in mind that it gets complicated as you add more "clustering logic" to the algorithm.

Community
  • 1
  • 1
Eitan T
  • 32,660
  • 14
  • 72
  • 109
  • Nice way! I have a question for you, the way you are calculating d, is a form of root-mean-square? – SpcCode Jan 23 '13 at 01:10
  • @SpcCode Where did you see "mean"? :) No, it's [Euclidean distance](http://en.wikipedia.org/wiki/Euclidean_distance), _i.e_ the root of the sum of the squared coordinate differences. – Eitan T Jan 23 '13 at 09:15