Determine cluster similarity / set overlap

Question

What I am trying to do is running two cluster algorithms (let's call them A and B) and then compare the results (i.e. the algorithms classify 80% of the models in the same clusters). A simple example with 3 models:

cl_A = c(1,2,1) #the result from Algorithm A
cl_B = c(2,1,2) #the result from Algorithm B

What do I hope to get as a solution from this? Well, 100%. Why? Because I can just "rename" the clusters in my head (rename clusters 1 for model B to 2 and cluster 2 to 1) and then find that the clusters are exactly the same.
In other words, the models that are in a cluster together are the same, and that is all we care about (we don't care about the "name" of the cluster and there is no inherent ordering in it).

What about this example?

cl_A = c(1,2,1,3)
cl_B = c(2,1,2,2)

(Note: The lengths of the vectors is always the same for both, but the values can be in different ranges)
In this case I'd like to get 3/4 as an answer (I.e. "rename" cl_B to c(1,2,1,1) and then say that there are 3 elements where cl_A and cl_B are the same.)

Now, I have written a function that I checked for simple examples by hand (i.e. the example above) but I can't help the feeling that for more complicated examples it's not functioning...
If any of you guys have ideas and / or solutions feel free to comment them.

This is my function, but I'll first explain what I do: I pass "cluster vectors" (the vectors of cluster assignments) to it (cl_A and cl_B in the example above). I then basically loop through all clusters for the first vector, and for all clusters for the second vector and choose the "best" overlap. Because I only want to choose each clusters once (i.e. I can't say I rename all the "1"s to "2"s, but later decide that I'd also like to rename some "1"s to "3"s (then I'd always get a perfect fit)) I keep a "taboo_list".
And that's basically all there is to it. However, it feels like it's not working 100% correctly, and I was hoping to find some help here. Thanks already!

#'  cluster similarity
#' 
#' calculate how "similar" two methods / clusters are:
#' sum(kmeans_cluster_similarity(cluster1, cluster2))/length(cluster1)
#' is the % of objects that are in a cluster with the same objects as before
#' 
#' @param cluster_vector_1 the first cluster object, as for example returned by knn() 
#' @param cluster_vector_2 the second cluster object
#' @export
#' 
cluster_similarity = function(cluster_vector_1, cluster_vector_2){
  taboo_list_2 <<- rep(NA, length(unique(cluster_vector_1)))
  overall_similarity <<- rep(NA, length(unique(cluster_vector_1)))
  
  for(i in unique(cluster_vector_1)){
    cl1 = which(cluster_vector_1 == i)
    similarity <- rep(NA, length(unique(cluster_vector_1)))
    for(j in unique(cluster_vector_2)){
      if(!(j %in% taboo_list_2)){
        cl2 = which(cluster_vector_2 == j)
        similarity[j] <- sum(cl1 %in% cl2)
      }
    }
    best_j <- which.max(similarity)
    taboo_list_2[i] <<- best_j
    overall_similarity[i] <<- max(similarity, na.rm = TRUE)
    #print(overall_similarity)
  }
  #print(overall_similarity)
  return(overall_similarity)
}

An example:

cl_A = c(1,2,1)
cl_B = c(2,1,2)
cluster_similarity(cl_A,cl_B)

works. But I'm pretty sure some other things don't work...

Edit

There seems to be some confusion as to why I am doing this, so let me try to clarify: I have data (who doesn't nowadays), and I obviously can't say from where exactly, but I thought of a nice analogy: think of several Kaggle case competitions (call them comp_A,comp_B,...).
For each competition you have several participants that hand some results in (call them part_1,...part_n and inp_1,...,inp_n respectively).
Now obviously not all participants hand something in for every competition. Competition A might have hand-ins from participants 1-20 while competition 2 might only have hand-ins from 1-10 and 20-25.
What I want to do is find out which "participants" are alike.
For example, part_1 is like part_2 and part_10, and so on

There is no validation set whatsoever (not even a small one), and for each "competition" there are about 20 participants, each with 1 input. These inputs are gigantic (well, 20MB each but it adds up)

My idea was then to cluster the participants (or rather, their input) for each competition, and see which participants are often in the same cluster (for example, if part_1 and part_2 are in the same cluster for comp_A and comp_B and comp_C, maybe they are alike)
And well, I wasn't aware of any theoretical justification for using one cluster method over the other, so I let them all run (and without a verification set it's pretty hard to evaluate), and now want to look, as @Logister correctly identified, the Robustness of each clustering algorithm to decide which one might be best.
I hope that clarifies the background of my question, and I'm always open for more constructive ideas!

Homunculus · Accepted Answer · 2017-11-25T22:25:36.083

2

This is a topic of cluster validation. There are already function in R that gives you values of "similarity" between clusters, such as Rand Index and Adjusted Rand Index. I suggest you using them.

The Adjusted Rand Index is the best approach for measuring agreement between clusters.

ARI measures not only the correct separation of elements belonging to a different classes, but also the relation between elements of the same class ( who said that )

The ARI function can be find here.

The math behind ARI is not basic at all. Because of this, I suggest you checking out the Rand Index measurement, which is very simple to understand, and implement it.

Note: Your function does not take into account when similarity vector has a couple of max. What to do in that case? I suggest you watching this video

edited Nov 25 '17 at 22:25

answered Nov 25 '17 at 22:04

Homunculus

329
2
12

Hi @Homunculus Thanks for the reply! Thanks also for the link to the article. As it is called "On the Use of the Adjusted Rand Index as a Metric for Evaluating Supervised Classification" I am not sure how to apply it to my case - I have no validation data / no true class labels. I am simply trying to cluster (i.e. say two points are close to each other --> let's put them in the same cluster). Because I only have 500 characters here I added a rather lengthy edit to my question, in case you are still interested. But anyways, thank you already! – fußballball Nov 26 '17 at 00:26
Sorry for the confusion. However, ARI will return you values reaching 1, the more alike are the clusters (1 being they are exactly the same). So, if participant A got the same results as the participant B, but PA `group 1` is PB `groups 2`, and so on, the ARI will return you 1. Then, after you can check which one is more robust, and compare if anyone else got the same, to see which participant wins the competition. – Homunculus Nov 26 '17 at 00:34
@user18093 Is that what you asked for? I highly suggest you watching the video I recommended, as I think it will give you more hints on how to enhance you function, if you dont want to use R's Adjusted Rand Index – Homunculus Nov 26 '17 at 00:38
Ah, thanks! So you'd say I could use the ARI index anyways, even though it is not supervised? If so, that indeed is VERY interesting! If you have any more sources about it that you'd recommend, I'd be happy to hear about them! – fußballball Nov 26 '17 at 00:38
1

Ah, thanks! So you'd say I could use the ARI index anyways, even though it is not supervised? If so, that indeed is VERY interesting! If you have any more sources about it that you'd recommend, I'd be happy to hear about them! – fußballball Nov 26 '17 at 00:43
Oh, yes. The article focus was on supervised classification, but it is the most recommended metric, as far as I know, to check the agreement of clusters, besides the type of clustering you used. I'm quite nooby on the Data Mining, so I haven't that many suggestions. But I can give you one more: Check out the book `algorithms for Clustering Data(Jain and Dubes)`. Its one of the bibles – Homunculus Nov 26 '17 at 00:46
Thanks a lot for all your hints! I accepted your answer, and I'm really grateful for your help! I think I'll just use the ARI Index – fußballball Nov 26 '17 at 00:48
Also, you can check for `Confusion Matrix`. This could help you enhance you function – Homunculus Nov 26 '17 at 00:53

score 1 · Answer 2 · answered Nov 25 '17 at 22:44

Usually "comparing results" between different algorithms isn't done with reference to the extent of agreement between them. So what if the algorithms have agreement? I'd suggest taking a step back and asking what you are trying to accomplish.

Typically what matters is the extent to which your clusters predict or identify some other phenomenon. E.g. If you are trying to use clustering to do some kind of classification, a good way to evaluate the models is to look at classification entropy.

The only reason I can think of why someone would want to do what you are doing is to check if clusters are 'robust'. If that is what you are trying to evaluate, comparing two different algorithms wont get you where you want; you have to compare like with like. I would suggest doing some cross-validation/sub-sampling checking if the same algorithm as agreement with itself over different iterations. R should have builtin functions that do this for you.

Hi @Logister You're right, I'm trying to see if they are "robust" The only problem I'm having with cross validation is that I have only 20 datapoints for each "competition" (see my question edit) and don't have any validation data, so I'm kind of stunned as to the implementation of your suggestion. Do you have any practical ideas? Cause I can't think of what to do - I don't even know the true number of clusters there are — fußballball, Nov 26 '17 at 00:23

Determine cluster similarity / set overlap

Edit

2 Answers2