Sklearn kmeans equivalent of elbow method

Question

Let's say I'm examining up to 10 clusters, with scipy I usually generate the 'elbow' plot as follows:

from scipy import cluster
cluster_array = [cluster.vq.kmeans(my_matrix, i) for i in range(1,10)]

pyplot.plot([var for (cent,var) in cluster_array])
pyplot.show()

I have since became motivated to use sklearn for clustering, however I'm not sure how to create the array needed to plot as in the scipy case. My best guess was:

from sklearn.cluster import KMeans

km = [KMeans(n_clusters=i) for i range(1,10)]
cluster_array = [km[i].fit(my_matrix)]

That unfortunately resulted in an invalid command error. What is the best way sklearn way to go about this?

Thank you

score 49 · Answer 1 · answered Jan 11 '18 at 17:14

you can use the inertia attribute of Kmeans class.

Assuming X is your dataset:

from sklearn.cluster import KMeans
from matplotlib import pyplot as plt

X = # <your_data>
distorsions = []
for k in range(2, 20):
    kmeans = KMeans(n_clusters=k)
    kmeans.fit(X)
    distorsions.append(kmeans.inertia_)

fig = plt.figure(figsize=(15, 5))
plt.plot(range(2, 20), distorsions)
plt.grid(True)
plt.title('Elbow curve')

J. P. Petersen · Accepted Answer · 2017-01-09T14:26:21.563

13

You had some syntax problems in the code. They should be fixed now:

Ks = range(1, 10)
km = [KMeans(n_clusters=i) for i in Ks]
score = [km[i].fit(my_matrix).score(my_matrix) for i in range(len(km))]

The fit method just returns a self object. In this line in the original code

cluster_array = [km[i].fit(my_matrix)]

the cluster_array would end up having the same contents as km.

You can use the score method to get the estimate for how well the clustering fits. To see the score for each cluster simply run plot(Ks, score).

edited Jan 09 '17 at 14:26

answered Jan 09 '17 at 14:21

J. P. Petersen

4,871
4
33
33

formating, `score = [KMeans(i).fit(my_matrix).score(my_matrix) for i in Ks]` – ExtractTable.com Feb 02 '18 at 21:29
2

Slightly more pythonic: score = [k.fit(my_matrix).score(my_matrix) for k in km] – Uri London Jun 13 '18 at 19:10
How is `my_matrix` defined? – jbehrens94 Jul 06 '18 at 10:57
@jbehrens94 pass in your data, dataframe, etc. in place of my_matrix – NomNomNom Apr 21 '21 at 23:22

score 6 · Answer 3 · answered Aug 15 '18 at 07:01

6

You can also use euclidean distance between the each data with the cluster center distance to evaluate how many clusters to choose. Here is the code example.

import numpy as np
from scipy.spatial.distance import cdist
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

iris = load_iris()
x = iris.data

res = list()
n_cluster = range(2,20)
for n in n_cluster:
    kmeans = KMeans(n_clusters=n)
    kmeans.fit(x)
    res.append(np.average(np.min(cdist(x, kmeans.cluster_centers_, 'euclidean'), axis=1)))

plt.plot(n_cluster, res)
plt.title('elbow curve')
plt.show()

answered Aug 15 '18 at 07:01

lugq

71
1
5

1

other answers have used the kmeans.inertia_ attribute of the sklearn kmeans object to measure how good the fit is. The sklearn [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) states: "inertia_: Sum of squared distances of samples to their closest cluster center, weighted by the sample weights if provided." So that is pretty much the same as the calculation you suggest, but will obviously be much quicker, as I'm guessing it is already calculated. – gnoodle Jan 03 '22 at 13:56

Sklearn kmeans equivalent of elbow method

3 Answers3

Linked