16

I'd like to use silhouette score in my script, to automatically compute number of clusters in k-means clustering from sklearn.

import numpy as np
import pandas as pd
import csv
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

filename = "CSV_BIG.csv"

# Read the CSV file with the Pandas lib.
path_dir = ".\\"
dataframe = pd.read_csv(path_dir + filename, encoding = "utf-8", sep = ';' ) # "ISO-8859-1")
df = dataframe.copy(deep=True)

#Use silhouette score
range_n_clusters = list (range(2,10))
print ("Number of clusters from 2 to 9: \n", range_n_clusters)

for n_clusters in range_n_clusters:
    clusterer = KMeans (n_clusters=n_clusters).fit(?)
    preds = clusterer.predict(?)
    centers = clusterer.cluster_centers_

    score = silhouette_score (?, preds, metric='euclidean')
    print ("For n_clusters = {}, silhouette score is {})".format(n_clusters, score)

Someone can help me with question marks? I don't understand what to put instead of question marks. I have taken the code from an example. The commented part is the previous versione, where I do k-means clustering with a fixed number of clusters set to 4. The code in this way is correct, but in my project I need to automatically chose the number of clusters.

Felipe Augusto
  • 7,733
  • 10
  • 39
  • 73
Jessica Martini
  • 253
  • 2
  • 3
  • 11
  • 1
    unfortunately silhouette has big problem with single cluster data-sets. because this metric is not responsible for single cluster problems. if your problem is still open you can try [this](https://github.com/NaegleLab/OpenEnsembles) – mostafa yari Jun 18 '19 at 06:57

2 Answers2

29

I am assuming you are going to silhouette score to get the optimal no. of clusters.

First declare a seperate object of KMeans and then call it's fit_predict functions over your data df like this

for n_clusters in range_n_clusters:
    clusterer = KMeans(n_clusters=n_clusters)
    preds = clusterer.fit_predict(df)
    centers = clusterer.cluster_centers_

    score = silhouette_score(df, preds)
    print("For n_clusters = {}, silhouette score is {})".format(n_clusters, score))

See this official example for more clarity.

Noki
  • 870
  • 10
  • 22
Gambit1614
  • 8,547
  • 1
  • 25
  • 51
0

The ? is the data set or Data frame that you are applying K-means to. Thank you.

  • 1
    As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Feb 22 '22 at 05:58