1

I would like to find the optimal number of clusters for a clustering algorithm using silouette scoring and a pre-computed distance matrix. In the example below I am using AgglomerativeClustering (but I might want to use a different clustering algorithm in the future).

from sklearn import cluster, metrics, model_selection


# define some clustering model
agglomerative_clustering = cluster.AgglomerativeClustering(affinity="precomputed")

def _silhouette_scoring(clustering_model, distances):
    return metrics.silhouette_score(distances, clustering_model.labels_, metric="precomputed")

# define distributions over parameters to optimize
n, _ = distances.shape
param_distributions = {'n_clusters': stats.randint(low=1, high=n),
                       'linkage': ["complete", "average"]}

prng = np.random.RandomState(42)
parameter_sampler = model_selection.ParameterSampler(param_distributions, n_iter=100, random_state=prng)

optimal_params = None
optimal_params_score = -np.inf

for i, sampled_params in enumerate(list(parameter_sampler)):
    agglomerative_clustering = cluster.AgglomerativeClustering(affinity="precomputed", **sampled_params)
    agglomerative_clustering.fit(distances)
    sampled_params_score = _silhouette_scoring(agglomerative_clustering, distances)

    if sampled_params_score > optimal_params_score:
        optimal_params, optimal_params_score = sampled_params, sampled_params_score

Running the above code works but I feel like choosing the optimal number of clusters is a pretty common task and that there should be some way to do this within sklearn.model_selection using RandomizedSearchCV or GridSearchCV or similar. How can this be done?

davidrpugh
  • 4,363
  • 5
  • 32
  • 46
  • Not implemented yet. There are a couple open issues and a PR in Github for that. See [this issue](https://github.com/scikit-learn/scikit-learn/issues/6154), and [this PR](https://github.com/scikit-learn/scikit-learn/pull/6160). – Qusai Alothman Sep 09 '18 at 00:34

0 Answers0