5

K-Medoids and K-Means are two popular methods of partitional clustering. My research suggests that K-Medoids is better at clustering data when there are outliers (source). This is because it chooses data points as cluster centers (and uses Manhattan distance), whereas K-Means chooses any center that minimizes the sum of squares, so it is more influenced by outliers.

This makes sense, however when I use these methods to do a simple test on made up data, it does not suggest that using Medoids is better at dealing with outliers, in fact it is sometimes worse. My question is: Where in the following test have I gone wrong? Perhaps I have some fundamental misunderstanding about these methods.

Demonstration: (see here for pictures) First, some made up data (named 'comp') which makes 3 obvious clusters

x <- c(2, 3, 2.4, 1.9, 1.6, 2.3, 1.8, 5, 6, 5, 5.8, 6.1, 5.5, 7.2, 7.5, 8, 7.2, 7.8, 7.3, 6.4)
y <- c(3, 2, 3.1, 2.6, 2.7, 2.9, 2.5, 7, 7, 6.5, 6.4, 6.9, 6.5, 7.5, 7.25, 7, 7.8, 7.5, 8.1, 7)

data.frame(x,y) -> comp

library(ggplot2)
ggplot(comp, aes(x, y)) + geom_point(alpha=.5, size=3, pch = 16)

enter image description here

It is clustered with the package 'vegclust', which can do both K-Means and K-Medoids.

library(vegclust)
k <- vegclust(x=comp, mobileCenters=3, method="KM", nstart=100, iter.max=1000) #K-Means
k <- vegclust(x=comp, mobileCenters=3, method="KMdd", nstart=100, iter.max=1000) #K-Medoids

When making a scatterplot, both K-Means and K-Medoids pick up the 3 obvious clusters.

color <- k$memb[,1]+k$memb[,2]*2+k$memb[,3]*3 # Making the different clusters have different colors

# K-Means scatterplot
ggplot(comp, aes(x, y)) + geom_point(alpha=.5, color=color, pch = 16, size=3)

# K-Medoids scatterplot
ggplot(comp, aes(x, y)) + geom_point(alpha=.5, color=color, size=3, pch = 16)

See *figure 2* in the link

Now an outlier is added:

comp[21,1] <- 3
comp[21,2] <- 7.5

This outlier shifts the center of the blue cluster to the left of the graph.

As a result, when using K-Medoids on the new data, the right-most point of the blue cluster is broken off and joins the red cluster.

See *figure 3* in the link

Interestingly, K-means actually generates better (more intuitive) clusters with the new data occasionally depending on the random initial cluster centers (you may have to run several times to get the correct clustering), whereas K-Medoids always generates the wrong clusters.

See *figure 4* in the link

As you can see from this example, K-Means was actually better at handling outliers than K-Medoids (same data, same package etc.). Have I done something wrong in my test or misunderstood how these methods work?

Community
  • 1
  • 1
Yang Li
  • 462
  • 1
  • 8
  • 21
  • This is more of a statistical methodology question rather than a specific programming question. As such its a better fit for [stats.se] rather than Stack Overflow. – MrFlick Nov 23 '15 at 04:23
  • Thank you MrFlick. I will probably post it there, though I hope someone on here may help me out anyway since I have seen similar questions about methodology on this site. – Yang Li Nov 23 '15 at 07:21
  • @Anony-Mousse It would have been "cheating" if I chose the best K-Means output and compared it to the worst K-Medoids output. In my post, I compared the best of both outputs, and K-Means was right about 25% of the time (compared to 0% for K-Medoids) – Yang Li Nov 23 '15 at 19:19
  • @Anony-Mousse To answer your question, the 'correct' clustering has a total sum of squares of 12.48, while the 'incorrect' one is lower at 12.47 (probably why it shows up 75% of the time). There is nothing wrong with doing multiple runs and choosing the best, which is often done when clustering in research. – Yang Li Nov 23 '15 at 19:29
  • 1
    @Anony-Mousse In research, I would not run K-Means multiples times and just choose the one with the lowest SSE, because another run may have converged at a local minimum which looks more intuitive. The idea here is that if a researcher used K-Means and K-Medoids on the data provided multiple times, they would sometimes get a good result from K-Means, but _never_ from K-Medoids. Does that not contradict the claim that K-Medoids is better when there are outliers? – Yang Li Nov 23 '15 at 19:38
  • @Anony-Mousse in that case, suppose I could not see which result looked the best. If I just ran the data once through K-Means and K-Medoids, I have a ~25% chance of getting the correct result from K-Means (which I would not know was the correct result), and I have a 0% chance of getting the correct result from K-Medoids. The 'toy example' was so it would be obvious how each method handled outliers, how else could I test it? In any case, **the idea behind this question is not changed by what data I use for demonstration - K-Medoids is supposed to be better at ignoring outliers, but 0% < 25%** – Yang Li Nov 23 '15 at 23:04
  • It's not perfect, and not ignoring outliers. It is just usually less sensitive. Consider this 1 dimensional test data set: 0 1 2 3 7 8 9 10 20 with 2 clusters and one outlier. – Has QUIT--Anony-Mousse Nov 24 '15 at 06:56
  • So what you mean is that K-Medoids is actually usually better, but the data in this demonstration was for some reason exceptional? Perhaps I will do tests on different data, thanks for the discussion anyway – Yang Li Nov 24 '15 at 07:05
  • Yes. The outlier may not be extreme enough, for example, compared to the regular cluster spread, and the separation of the clusters. Try placing it at 2,6 – Has QUIT--Anony-Mousse Nov 24 '15 at 11:32

0 Answers0