4

I am practicing K-Means clustering using sklearn package. I am working with sample shopping dataset, which consists of how much each customers spent in each item categories (i.e., food, fashion, digital, etc.)

There are 42 features, meaning 42 item categories I have used to input into K-Means. When I checked silhouette coefficient for k between 2 - 50, the result looks like this:

Result

For n_clusters=2, The Silhouette Coefficient is 0.296883351294 
For n_clusters=3, The Silhouette Coefficient is 0.429716008727
For n_clusters=4, The Silhouette Coefficient is 0.5379833453
For n_clusters=5, The Silhouette Coefficient is 0.640200087198
For n_clusters=6, The Silhouette Coefficient is 0.720988889121
For n_clusters=7, The Silhouette Coefficient is 0.754509135746
For n_clusters=8, The Silhouette Coefficient is 0.824498184042
For n_clusters=9, The Silhouette Coefficient is 0.859505132529
For n_clusters=10, The Silhouette Coefficient is 0.886719390512
For n_clusters=11, The Silhouette Coefficient is 0.909094073152
For n_clusters=12, The Silhouette Coefficient is 0.924484657787
For n_clusters=13, The Silhouette Coefficient is 0.935920328988
For n_clusters=14, The Silhouette Coefficient is 0.941202266924
For n_clusters=15, The Silhouette Coefficient is 0.944696312832
For n_clusters=16, The Silhouette Coefficient is 0.94973283735
For n_clusters=17, The Silhouette Coefficient is 0.953130541493
For n_clusters=18, The Silhouette Coefficient is 0.956455183621
For n_clusters=19, The Silhouette Coefficient is 0.959253033224
For n_clusters=20, The Silhouette Coefficient is 0.962360042108
For n_clusters=21, The Silhouette Coefficient is 0.964250208432
For n_clusters=22, The Silhouette Coefficient is 0.967326417612
For n_clusters=23, The Silhouette Coefficient is 0.969331109452
For n_clusters=24, The Silhouette Coefficient is 0.971127562002
For n_clusters=25, The Silhouette Coefficient is 0.972261973972
For n_clusters=26, The Silhouette Coefficient is 0.9734445716
For n_clusters=27, The Silhouette Coefficient is 0.974238560202
For n_clusters=28, The Silhouette Coefficient is 0.97488260729
For n_clusters=29, The Silhouette Coefficient is 0.97531193231
For n_clusters=30, The Silhouette Coefficient is 0.974524792419
For n_clusters=31, The Silhouette Coefficient is 0.975612314038
For n_clusters=32, The Silhouette Coefficient is 0.975737449165
For n_clusters=33, The Silhouette Coefficient is 0.976396323376
For n_clusters=34, The Silhouette Coefficient is 0.977655049988
For n_clusters=35, The Silhouette Coefficient is 0.977653124893
For n_clusters=36, The Silhouette Coefficient is 0.977692656935
For n_clusters=37, The Silhouette Coefficient is 0.977631627533
For n_clusters=38, The Silhouette Coefficient is 0.978547753839
For n_clusters=39, The Silhouette Coefficient is 0.978886776953
For n_clusters=40, The Silhouette Coefficient is 0.979381767137
For n_clusters=41, The Silhouette Coefficient is 0.9796349521
For n_clusters=42, The Silhouette Coefficient is 0.979461929477
For n_clusters=43, The Silhouette Coefficient is 0.980920963377
For n_clusters=44, The Silhouette Coefficient is 0.980129624336
For n_clusters=45, The Silhouette Coefficient is 0.981374785468
For n_clusters=46, The Silhouette Coefficient is 0.980656482976
For n_clusters=47, The Silhouette Coefficient is 0.982323770297
For n_clusters=48, The Silhouette Coefficient is 0.982538183341
For n_clusters=49, The Silhouette Coefficient is 0.982842003856

I don't know how to make use of this result. It seems to me, the s keeps getting bigger as I move forward. Am I doing this right? or should I try a different cluster evaluation method?

Vadim Yangunaev
  • 1,817
  • 1
  • 18
  • 41
2D_
  • 571
  • 1
  • 9
  • 17

1 Answers1

4

The silhouette of a point measures how similar a point is to its cluster versus the next closest cluster. This is a ratio of the distances to the cluster centers, normalized so that "1" is a perfect match to its cluster and "-1" a perfect mismatch.

(Note: the use of cluster centers may be particular to k-means clustering.)

The silhouette of a cluster is the average silhouette of all of its members. What this means is practice is that a larger number means that the cluster is "separated" from its other clusters.

I think of silhouettes as measuring the density of points along the boundary of a cluster. When the silhouette is high, then the boundary has very few points. That is what you want -- well separated clusters.

When using k-means, small "outlier" clusters would typically have large silhouettes. Often the larger clusters have dense boundaries. It would be interesting for you to look at the size as well as the silhouette.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Thank you. So for the result I got, 49-clusters is better than 2-clusters. And this means that with 49-cluster, it is more separated from other clusters. Am I correct? – 2D_ Jun 18 '17 at 07:09
  • 1
    @2D_ . . . Well, you have to evaluate the clusters in different ways. If you have a separate cluster for each point, then I think the silhouettes will look pretty good (I'm not 100% sure what happens in the degenerate case). More important: are the clusters useful? – Gordon Linoff Jun 18 '17 at 12:59
  • You are right. I think you may be right. I certainly do not want too many clusters. I will look into the clusters and will determine what number makes the most sense. Thanks! – 2D_ Jun 19 '17 at 14:23