I am using grid search having silhouette score , but on some algorithms(DBSCAN) it return cluster 1 as it has the highest score. For example I was performing image clustering with default sklearn DBSCAN function it resulted silhoutte score -0.03 and 30+ well defined clusters but when I perform gridsearch it resulted higher silhouette score around 0.123 but only 1 cluster. How can I best hypertune my clustering algorithms using grid search.
Update: I am sharing the snippet of the code , I take the reference from Scikit Learn GridSearchCV without cross validation (unsupervised learning)
This is the score function:
def cv_silhouette_scorer(estimator, X):
estimator.fit(X)
try:
cluster_labels = estimator.labels_
except Exception as e:
# print(e,estimator)
cluster_labels=estimator.predict(X)
num_labels = len(set(cluster_labels))
num_samples = len(X.index)
if num_labels == 1 or num_labels == num_samples:
return -1
else:
return metrics.silhouette_score(X, cluster_labels)
This is the gridSearch function
def runGridSearch(estimator,params_dict,train_data):
cv = [(slice(None), slice(None))]
gs = GridSearchCV(estimator=estimator, param_grid=params_dict, scoring=cv_silhouette_scorer, cv=cv, n_jobs=-1)
gs.fit(train_data)
try:
predicted_labels= gs.best_estimator_.labels_
except:
predicted_labels=gs.predict(train_data)
return predicted_labels