0

having the code for manual and therefore possibly wrong Elbow method selection of optimal number of clusters when K-modes clustering of binary df:

cost = []
for num_clusters in list(range(1,10)):
    kmode = KModes(n_clusters=num_clusters, init = "Huang", n_init = 10)
    kmode.fit_predict(newdf_matrix)
    cost.append(kmode.cost_)

y = np.array([i for i in range(1,10,1)])
plt.plot(y,cost)

An outcome of the for loop is a plot with the so called elbow curve. I know this curve helps me choose a optimal K. I do not want to do that myself tho, I am looking for some computational way. I want a computer to do the job without me determining it "manually". Otherwise it stops executing the whole code at some point.

Thank you.

What would be the code for selecting the K automatically that would replace my manual selection? Thank you.

Mr.Slow
  • 490
  • 1
  • 1
  • 16
  • How about choosing k when elbow curve's y-axis error/accuracy's cumulative sum reaches a threshold point ? Or maybe look for the change in error/accuracy of your y-axis on increment of value of k & if it's below threshold point then that's your k... – Sachin Kohli Sep 27 '22 at 05:14

1 Answers1

0

Use silhouette coefficient [will not work if the data points are represented as categorical values rather then N-d points]

The silhouette coefficient give the measure of how similar a data point is within the cluster compared to other clusters. check Sklearn doc here.

The best value is 1 and the worst value is -1. Values near 0 indicate overlapping clusters. Negative values generally indicate that a sample has been assigned to the wrong cluster, as a different cluster is more similar.

So calculate silhouette_score for different values of k and use the one which has best score (near to 1).

Sample using digits dataset.

from sklearn.cluster import KMeans
import numpy as np
from sklearn.datasets import load_digits

data, labels = load_digits(return_X_y=True)

from sklearn.metrics import silhouette_score
silhouette_avg = []
for num_clusters in list(range(2,20)):
    kmeans = KMeans(n_clusters=num_clusters, init = "k-means++", n_init = 10)
    kmeans.fit_predict(data)
    score = silhouette_score(data, kmeans.labels_)
    silhouette_avg.append(score)

import matplotlib.pyplot as plt
plt.plot(np.arange(2,20),silhouette_avg,'bx-')
plt.xlabel('Values of K') 
plt.ylabel('Silhouette score') 
plt.title('Silhouette analysis For Optimal k')
_ = plt.xticks(np.arange(2,20))

print (f"Best K: {np.argmax(silhouette_avg)+2}")

output:

Best K: 9

enter image description here

mujjiga
  • 16,186
  • 2
  • 33
  • 51
  • Thank you for that. I am working with K-modes, not K-means. I am not sure then if the Silhouette analysis works in that case since it uses the means.. – Mr.Slow Sep 27 '22 at 06:53
  • @VáclavPech If your data is represented as categorical values rather then N-d points, then yes you will not be able to calculate the `silhouette_score` score. How about one-hot encoding of the categorical values ? – mujjiga Sep 27 '22 at 07:33