0

I am trying to "unshuffle" the rows of a matrix containing the centroids of some clusters which are not in the same order as the order in which the samples were assigned to the clusters. Initially I was comparing the absolute value of the distance between the data points of the mean and the cluster centers and assign the index of the row which had the smallest distance. Of course, I am not allowed to have duplicate indexes. It worked pretty good but the symmetric values raise a problem (i.e., due to the absolute value for the distance, mirror clusters were not ordered properly). Also I tried to order them based on the variance, did not work as expected. I have been looking at the order() and sort() function and found an example which did not work.

order(mean)        
order(mean)[centers]       
sort(order(mean)[centers]) 
mean[sort(order(mean)[centers])]

I also tried the

apply(mean==centers,1,all)

but of course that just returns FALSE everywhere.

A sample of the matrices:

means <- c(0.055190097, 0.032412395,    0.015372307,    -0.008012372,
-0.018736792,   -0.078138715, -0.058707713,   -0.044020629,
-0.023750329,   -0.014402083, -0.069920581,   -0.064429216,
-0.059913345,   -0.052302253,   -0.047874074,  0.050557395,
0.047246979,    0.044577065,    0.040384336,    0.038140009,
0.114954601,    0.108110051,    0.102531680,    0.093341425,    0.088140310)
dim(means) <- c(5,5)
means <- t(means)


centers <- c(-0.038754, -0.021588,-0.008851,    0.008579,   0.016579,
 0.018371,   0.006095,   -0.003026,  -0.015537, -0.021286,
-0.078143,  -0.069267,  -0.062197,   -0.051295,  -0.045521,
 0.033145,   0.033348,   0.033354,   0.032947,   0.032511,
 0.115464,   0.105248,   0.097172,   0.084732,   0.078162)
dim(centers) <- c(5,5)
centers <- t(centers)

For instance (with the above example), line 2 from the means matrix corresponds to line 3 from the centers matrix as it is the closest in distance (data point wise). So, I have to find which line from the means corresponds to which line in centers (no duplicates). My matrices are bigger, but this should be enough as example Do you have any suggestions? Thank you

Marius
  • 990
  • 1
  • 14
  • 34
  • 1
    A bit of sample data would help. See also http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Joris Meys Jan 23 '12 at 13:11
  • I edited your sample code so it could be copy-pasted more easily, and I also corrected your use of mat.or.vec(). You simply overwrite the matrix you made with a vector, so that didn't really help. Check if the output is what you thought. – Joris Meys Jan 23 '12 at 14:09
  • I'm still trying to understand what you try to do. You'll have to clarify, or nobody will be able to help you. It would be interesting to check which outcome you expect given the input, and what the reasoning behind that output is. That would give us something to work with. – Joris Meys Jan 23 '12 at 14:13
  • @JorisMeys Hi, Well the main idea is that I have the output of a K-means algorithm which was implemented in C and I wanted to compare it with the output of the K-means function in R. In the output of the C algorithm, the samples are assigned to the clusters, but in the file containing the cluster centroids, the order is not the same as the order in which the samples were assigned to clusters. So, by taking the mean of the samples assigned to cluster 1 (for ex.), it's plotted line should overlap the plot line of the 1st row in the cluster centroid matrix (i.e., the output of the algorithm in C) – Marius Jan 23 '12 at 14:23
  • 1
    I meant more like: which rows (or columns???) do you want to match between both matrices. For example: row 1 in means to row 3 in centers, and then explain why. I have a vague idea of where you're coming from, but this is just impossible to answer as I can interprete both your question and your data in many ways. – Joris Meys Jan 23 '12 at 14:32
  • @JorisMeys So, if you copy paste the example I given (even if it is poorly programmed in R) the 3rd line from the **centers** matrix corresponds to the 2nd line in the **means** matrix (the closest in distance). In other words, I have to find which line from the **means** corresponds to which line in **centers**. – Marius Jan 23 '12 at 14:34
  • What this seems to be boiling down to is: what distance function do you want to use? Rough example: if by "line" you mean "row," then for `centers[3,]` calculate `sum(abs(centers[3,]-means[i,]))` for all `i` and take the minimum of the results. Not my recommended distance function, just seeing if this is what you're getting at. Try not to use "corresponds" since that's got no mathematical meaning here. – Carl Witthoft Jan 23 '12 at 15:45
  • @CarlWitthoft Hi, I already tried with the minimal distance between the data points per each row of the **means** and each row of the **centers**. Due to the absolute value, for the clusters which are symmetrical (mirror image (by X axis)) this does not work. I also tried to order them by the variance and/or covariance, on top of the computed distance and this did not help that much either. I already stated in the question text what I have tried. And the same as usual.., try not to pick on the semantics or grammar. That is not the point. The problem description is understandable enough. – Marius Jan 23 '12 at 16:16
  • @Marius, I did not intend to disparage. It really is important to use the correct wording when dealing with mathematics. Just imagine, say, someone using "group" when they mean "set." So, back to the problem at hand: how are you deciding that your approaches so far aren't "working" ? You seem to have some external criteria which have preselected the correct "correspondances," but maybe those criteria aren't single-valued, or aren't self-consistent. As Joris said, it really isn't clear what defines the desired mapping from _centers_ to _means_ . – Carl Witthoft Jan 23 '12 at 18:50
  • @CarlWitthoft I can send you a plot of what I was able to do so far. I do not know if I am allowed to attach pdf or image on the site. I have two matrices. One contains the means of the samples that are assigned to their corresponding cluster and the other containing the cluster centroids (which, if plotted, should overlap with the means). The main idea is for each row of the first matrix, find its corresponding row in the second. By corresponding I mean closest in value at each data point. Absolute value raises the same issue as correlation distance, the two plots being symmetrical by X axis. – Marius Jan 23 '12 at 22:57

1 Answers1

-1

Well, I did not find any built-in function to do the job, so I just implemented a recursive algorithm which takes care of the job..even if it is not the way of R programming, at least it solves the problem. A pretty nasty problem in this particular case, I might add, but now it's working. Thanks to all who showed interest in this question.

Marius
  • 990
  • 1
  • 14
  • 34