2

I have a dataset that has 6497 instance, 12 attributes, and a class variable called q (quality). The class values can range from 3 to 9. The data can be downloaded in CSV format from here

I am doing k-means cluster on this dataset and would like to plot it. But there seems to be something wrong with the plots I'm generating because I don't think they are representing the clusters. The plot I'm trying to generate is referred from this SO answer How to create a cluster plot in R?

Here is what I'm doing

library(vegan)
winequality <- read.csv("wine_nocolor.csv")
express <- winequality[, c("fa", "va", "ca", "rs", "ch", "fsd", "tsd", "d", "p", "s", "a")]
rownames(express) <- winequality$id
str(express) #'data.frame': 6497 obs. of  11 variables
kclus <- kmeans(express,centers= 3, iter.max=1000, nstart=10000) #takes a bit of time
wine_dist <- dist(express)
cmd <- cmdscale(wine_dist) #takes bit of time
groups <- levels(factor(kclus$cluster))
ordiplot(cmd, type = "n") #shows warning that Species scores not available
cols <- c("steelblue", "darkred", "darkgreen")
for(i in seq_along(groups)){
    points(cmd[factor(kclus$cluster) == groups[i], ], col = cols[i], pch = 16)
}

# add spider and hull
ordispider(cmd, factor(kclus$cluster), label = TRUE)
ordihull(cmd, factor(kclus$cluster), lty = "dotted")

The above code produces the following plot. But as you can see, the clusters aren't demonstrated in a clear fashion.

enter image description here

Questions

  • What are Dim1 and Dim2?
  • How can I fix this?
  • Additionally, does R offer a way to produce a plot similar to the plot generated by scikit for showing clusters and centroids?
Community
  • 1
  • 1
birdy
  • 9,286
  • 24
  • 107
  • 171
  • you are creating a cluster over 11 variables, it is normal that on a 2 dimensional plot the clusters do not look separated. By the way I would try to reduce first the number of variables before applying the k-means. You might have much better results – RockScience Apr 01 '15 at 01:37
  • ok, thanks for clarifying. I still need to understand what Dim1 and Dim2 mean? and whether its possible to create plot similar to this http://scikit-learn.org/stable/_images/plot_kmeans_digits_0011.png Here the class values could be from 1 to 10 and they chose 10 clusters – birdy Apr 01 '15 at 01:46
  • The data doesn't cluster - at least not with kmeans. **The ptroduced clusters are meaningless.** there is no separation or structure captured. – Has QUIT--Anony-Mousse Apr 01 '15 at 06:10
  • 1
    The tiles appear to be produced using a varonoi diagram, which is not clustering data _per se_. – Roman Luštrik Apr 01 '15 at 07:39
  • @Anony-Mousse are you saying the wine data doesn't cluster?? – Anthony Apr 02 '15 at 12:15
  • Not with this projection/preprocessing/normalization. The clusters he got are meaningless; it's simply splitting the data into three *slices* based on the first principal component. – Has QUIT--Anony-Mousse Apr 02 '15 at 12:57

2 Answers2

3

The author of this code (from the other SO question) is using a dimension reduction using MDS (Multi Dimensional Scaling) to plot the cluster.

Read ?cmdscale to understand.

Also some good sources here and here.

Whether you want to do this dimension reduction, and before or after the clustering, is your choice, I am not sure there is anything "to fix" in this code, it is more for you to decide what you want to do and plot. I would suggest you try first to reduce the number of variables before the clustering. 11 is really a lot. Are they all useful?

Also remember that variables need to be normalized before applying the k-means.

RockScience
  • 17,932
  • 26
  • 89
  • 125
2

Do not forget to preprocess your data carefully!

In the image you showed above, the result was completely dominated* by the tsd attribute. All other data was essentially not taken into account! (The fsd attribute had some minor effect, the others were dwarfed.)

The data set does not appear to cluster well.

This is the best result I could get: enter image description here

One may argue that there are two types in this data set. But they are not well separated. It may as well be an oddly shaped single cluster.

In particular, the way the data is split changes a lot depending on how you preprocess and scale your data. That indicates the results are not stable.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • I will be applying PCA and other data preprocessing techniques on this dataset now to see how it changes. Can you **please** share how you crated the above plot? So that I can plot it before and after my preprocessing techniques to see how it splits? Also, how did you determine that `tsd` attribute was overpowering the data. Thanks! – Anthony Apr 02 '15 at 14:10
  • I don't use R, so I cannot share R code with you. I plotted the `tsd` attribute, and the k-means clusters would be orthogonal slices, as in your plot above. – Has QUIT--Anony-Mousse Apr 02 '15 at 14:12
  • Did you use matlab or scipy? I'm looking for a way to visually see how well the data splits before/after I apply preprocessing to it using PCA, ICA, etc. I don't mind switching tools to achieve what I'm after. – Anthony Apr 02 '15 at 14:14
  • So, in the plot you created only the tsd attribute was taken into account? – Anthony Apr 02 '15 at 14:15
  • No. But when I did *not* preprocess the data, and I chose x axis = tsd, it looked liked your plot, with the data being split according to the x axis. – Has QUIT--Anony-Mousse Apr 02 '15 at 17:31