I would like to find the optimal number of clusters for a clustering algorithm using silouette scoring and a pre-computed distance matrix. In the example below I am using AgglomerativeClustering
(but I might want to use a different clustering algorithm in the future).
from sklearn import cluster, metrics, model_selection
# define some clustering model
agglomerative_clustering = cluster.AgglomerativeClustering(affinity="precomputed")
def _silhouette_scoring(clustering_model, distances):
return metrics.silhouette_score(distances, clustering_model.labels_, metric="precomputed")
# define distributions over parameters to optimize
n, _ = distances.shape
param_distributions = {'n_clusters': stats.randint(low=1, high=n),
'linkage': ["complete", "average"]}
prng = np.random.RandomState(42)
parameter_sampler = model_selection.ParameterSampler(param_distributions, n_iter=100, random_state=prng)
optimal_params = None
optimal_params_score = -np.inf
for i, sampled_params in enumerate(list(parameter_sampler)):
agglomerative_clustering = cluster.AgglomerativeClustering(affinity="precomputed", **sampled_params)
agglomerative_clustering.fit(distances)
sampled_params_score = _silhouette_scoring(agglomerative_clustering, distances)
if sampled_params_score > optimal_params_score:
optimal_params, optimal_params_score = sampled_params, sampled_params_score
Running the above code works but I feel like choosing the optimal number of clusters is a pretty common task and that there should be some way to do this within sklearn.model_selection
using RandomizedSearchCV
or GridSearchCV
or similar. How can this be done?