Split Data knowing its common ID

Question

I want to split this data,

ID x    y
1  2.5  3.5
1  85.1 74.1
2  2.6  3.4
2  86.0 69.8
3  25.8 32.9
3  84.4 68.2
4  2.8  3.2
4  24.1 31.8
4  83.2 67.4

I was able, making match with their partner like,

ID x    y    ID x    y   
1  2.5  3.5  1  85.1 74.1
2  2.6  3.4  2  86.0 69.8
             3  25.8 32.9
             4  24.1 31.8

However, as you notice some of the new row in ID 4 were placed wrong, because it just got added in the next few rows. I want to split them properly without having to use complex logic which I am already using... Someone can give me an algorithm or idea?

it should looks like,

ID x    y    ID x    y      ID x    y 
1  2.5  3.5  1  85.1 74.1   3  25.8 32.9
2  2.6  3.4  2  86.0 69.8   4  24.1 31.8
4  2.8  3.2  3  84.4 68.2
             4  83.2 67.4

Your original data has more than what you say it should look like. How is it being filtered? — Peter Wood, Jan 22 '13 at 12:32
I don't understand the logic behind your grouping. Are you trying to cluster? I don't see any obvious way to take your input and end up with the expected output. Please explain. — paddy, Jan 22 '13 at 12:32
Using the Id is looking for x and y. if they are similar in the next ID it means they are the same. — SpcCode, Jan 22 '13 at 12:55
You need to explain the wanted result in more details. Why is for example ID 3 suddenly in the same row as ID 1? And why is ID 4 suddenly alone at the last row? — KlausCPH, Jan 22 '13 at 14:24

score 1 · Accepted Answer · edited May 23 '17 at 10:34

It seems that your question is really about clustering, and that the ID column has nothing to do with the determining which points correspond to which.

A common algorithm to achieve that would be k-means clustering. However, your question implies that you don't know the number of clusters in advance. This complicates matters, and there have been already a lot of questions asked here on StackOverflow regarding this issue:

Unfortunately, there is no "right" solution for this. Two clusters in one specific problem could be indeed considered as one cluster in another problem. This is why you'll have to decide that for yourself.

Nevertheless, if you're looking for something simple (and probably inaccurate), you can use Euclidean distance as a measure. Compute the distances between points (e.g. using pdist), and group points where the distance falls below a certain threshold.

Example

%// Sample input
A = [1,  2.5,  3.5;
     1,  85.1, 74.1;
     2,  2.6,  3.4;
     2,  86.0, 69.8;
     3,  25.8, 32.9;
     3,  84.4, 68.2;
     4,  2.8,  3.2;
     4,  24.1, 31.8;
     4,  83.2, 67.4];

%// Cluster points
pairs = nchoosek(1:size(A, 1), 2); %// Rows of pairs
d = sqrt(sum((A(pairs(:, 1), :) - A(pairs(:, 2), :)) .^ 2, 2)); %// d = pdist(A)
thr = d < 10;                      %// Distances below threshold
kk = 1;
idx = 1:size(A, 1);
C = cell(size(idx));               %// Preallocate memory
while any(idx)
     x = unique(pairs(pairs(:, 1) == find(idx, 1) & thr, :));
     C{kk} = A(x, :);
     idx(x) = 0;                   %// Remove indices from list
     kk = kk + 1;
end
C = C(~cellfun(@isempty, C));      %// Remove empty cells

The result is a cell array C, each cell representing a cluster:

C{1} =
    1.0000    2.5000    3.5000
    2.0000    2.6000    3.4000
    4.0000    2.8000    3.2000

C{2} =
    1.0000   85.1000   74.1000
    2.0000   86.0000   69.8000
    3.0000   84.4000   68.2000
    4.0000   83.2000   67.4000

C{3} = 
    3.0000   25.8000   32.9000
    4.0000   24.1000   31.8000

Note that this simple approach has the flaw of restricting the cluster radius to the threshold. However, you wanted a simple solution, so bear in mind that it gets complicated as you add more "clustering logic" to the algorithm.

Nice way! I have a question for you, the way you are calculating d, is a form of root-mean-square? — SpcCode, Jan 23 '13 at 01:10
@SpcCode Where did you see "mean"? :) No, it's [Euclidean distance](http://en.wikipedia.org/wiki/Euclidean_distance), _i.e_ the root of the sum of the squared coordinate differences. — Eitan T, Jan 23 '13 at 09:15

Split Data knowing its common ID

1 Answers1

Example