22

Let's say I'm examining up to 10 clusters, with scipy I usually generate the 'elbow' plot as follows:

from scipy import cluster
cluster_array = [cluster.vq.kmeans(my_matrix, i) for i in range(1,10)]

pyplot.plot([var for (cent,var) in cluster_array])
pyplot.show()

I have since became motivated to use sklearn for clustering, however I'm not sure how to create the array needed to plot as in the scipy case. My best guess was:

from sklearn.cluster import KMeans

km = [KMeans(n_clusters=i) for i range(1,10)]
cluster_array = [km[i].fit(my_matrix)]

That unfortunately resulted in an invalid command error. What is the best way sklearn way to go about this?

Thank you

Arash Howaida
  • 2,575
  • 2
  • 19
  • 50

3 Answers3

49

you can use the inertia attribute of Kmeans class.

Assuming X is your dataset:

from sklearn.cluster import KMeans
from matplotlib import pyplot as plt

X = # <your_data>
distorsions = []
for k in range(2, 20):
    kmeans = KMeans(n_clusters=k)
    kmeans.fit(X)
    distorsions.append(kmeans.inertia_)

fig = plt.figure(figsize=(15, 5))
plt.plot(range(2, 20), distorsions)
plt.grid(True)
plt.title('Elbow curve')
Ahmed Besbes
  • 821
  • 1
  • 8
  • 7
13

You had some syntax problems in the code. They should be fixed now:

Ks = range(1, 10)
km = [KMeans(n_clusters=i) for i in Ks]
score = [km[i].fit(my_matrix).score(my_matrix) for i in range(len(km))]

The fit method just returns a self object. In this line in the original code

cluster_array = [km[i].fit(my_matrix)]

the cluster_array would end up having the same contents as km.

You can use the score method to get the estimate for how well the clustering fits. To see the score for each cluster simply run plot(Ks, score).

J. P. Petersen
  • 4,871
  • 4
  • 33
  • 33
6

You can also use euclidean distance between the each data with the cluster center distance to evaluate how many clusters to choose. Here is the code example.

import numpy as np
from scipy.spatial.distance import cdist
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

iris = load_iris()
x = iris.data

res = list()
n_cluster = range(2,20)
for n in n_cluster:
    kmeans = KMeans(n_clusters=n)
    kmeans.fit(x)
    res.append(np.average(np.min(cdist(x, kmeans.cluster_centers_, 'euclidean'), axis=1)))

plt.plot(n_cluster, res)
plt.title('elbow curve')
plt.show()
lugq
  • 71
  • 1
  • 5
  • 1
    other answers have used the kmeans.inertia_ attribute of the sklearn kmeans object to measure how good the fit is. The sklearn [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) states: "inertia_: Sum of squared distances of samples to their closest cluster center, weighted by the sample weights if provided." So that is pretty much the same as the calculation you suggest, but will obviously be much quicker, as I'm guessing it is already calculated. – gnoodle Jan 03 '22 at 13:56