41

I'm clustering a sample of about 100 records (unlabelled) and trying to use grid_search to evaluate the clustering algorithm with various hyperparameters. I'm scoring using silhouette_score which works fine.

My problem here is that I don't need to use the cross-validation aspect of the GridSearchCV/RandomizedSearchCV, but I can't find a simple GridSearch/RandomizedSearch. I can write my own but the ParameterSampler and ParameterGrid objects are very useful.

My next step will be to subclass BaseSearchCV and implement my own _fit() method, but thought it was worth asking is there a simpler way to do this, for example by passing something to the cv parameter?

def silhouette_score(estimator, X):
    clusters = estimator.fit_predict(X)
    score = metrics.silhouette_score(distance_matrix, clusters, metric='precomputed')
    return score

ca = KMeans()
param_grid = {"n_clusters": range(2, 11)}

# run randomized search
search = GridSearchCV(
    ca,
    param_distributions=param_dist,
    n_iter=n_iter_search,
    scoring=silhouette_score,
    cv= # can I pass something here to only use a single fold?
    )
search.fit(distance_matrix)
Jamie Bull
  • 12,889
  • 15
  • 77
  • 116
  • You don't do cross-validation (or grid-search) in *unsupervised* data mining. Just compute the 10 runs of k-means, and use the best. – Has QUIT--Anony-Mousse Jan 05 '16 at 12:16
  • 6
    Obviously you don't do cross-validation, but why not do grid search given an appropriate scoring metric such as silhouette score? – Jamie Bull Jan 05 '16 at 12:21
  • 1
    Also, kmeans is just an example here. I'd like to test a number of different algorithms and their hyperparameters. – Jamie Bull Jan 05 '16 at 12:22
  • You might as well optimize silhouette directly then. Don't expect the clustering result to really improve this way. In the end, you just look at which parameters agree best with Silhouette. It's just another criterion than SSE. – Has QUIT--Anony-Mousse Jan 05 '16 at 13:44
  • What would I use to do that without using one of the `BaseSearchCV` subclasses? Have I missed some feature for optimising hyperparameters, or do you mean write something specific for each algorithm? – Jamie Bull Jan 05 '16 at 13:48
  • I'm suggesting to directly search for the optimum silhouette solution, without using any clustering method. Naive enumeration won't work, but say evoluationary optimization or something like this may work. k-means does not optimize the silhouette, but that doesn't say there isn't an algorithm which does. – Has QUIT--Anony-Mousse Jan 05 '16 at 16:25
  • 2
    Ah, I see. I may want to add extra things to the scoring method though (preferred size of clusters, similarity of clusters size, etc) so I'm really looking for a way of doing something a lot like grid search. Thanks for the suggestions though. – Jamie Bull Jan 05 '16 at 17:10
  • Please see if [this](https://stackoverflow.com/questions/44636370/scikit-learn-gridsearchcv-without-cross-validation-unsupervised-learning) answers your question. – Pushkar Nimkar Dec 18 '18 at 06:17
  • Hey @JamieBull can I reach out to you ? – userrr Jul 04 '23 at 13:19

3 Answers3

10

The clusteval library will help you to evaluate the data and find the optimal number of clusters. This library contains five methods that can be used to evaluate clusterings: silhouette, dbindex, derivative, dbscan and hdbscan.

pip install clusteval

Depending on your data, the evaluation method can be chosen.

# Import library
from clusteval import clusteval

# Set parameters, as an example dbscan
ce = clusteval(method='dbscan')

# Fit to find optimal number of clusters using dbscan
results= ce.fit(X)

# Make plot of the cluster evaluation
ce.plot()

# Make scatter plot. Note that the first two coordinates are used for plotting.
ce.scatter(X)

# results is a dict with various output statistics. One of them are the labels.
cluster_labels = results['labx']
dyluns
  • 155
  • 9
erdogant
  • 1,544
  • 14
  • 23
  • this is very cool - any idea how to fit this into a pipeline to optimise earlier stages, such as TFIDF etc? – dendog Jul 17 '20 at 14:27
7

Ok, this might be an old question but I use this kind of code:

First, we want to generate all the possible combinations of parameters:

def make_generator(parameters):
    if not parameters:
        yield dict()
    else:
        key_to_iterate = list(parameters.keys())[0]
        next_round_parameters = {p : parameters[p]
                    for p in parameters if p != key_to_iterate}
        for val in parameters[key_to_iterate]:
            for pars in make_generator(next_round_parameters):
                temp_res = pars
                temp_res[key_to_iterate] = val
                yield temp_res

Then create a loop out of this:

# add fix parameters - here - it's just a random one
fixed_params = {"max_iter":300 } 

param_grid = {"n_clusters": range(2, 11)}

for params in make_generator(param_grid):
    params.update(fixed_params)
    ca = KMeans( **params )
    ca.fit(_data)
    labels = ca.labels_
    # Estimate your clustering labels and 
    # make decision to save or discard it!

Of course, it can be combined in a pretty function. So this solution is mostly an example.

Hope it helps someone!

Alexander B.
  • 626
  • 6
  • 21
2

Recently I ran into similar problem. I defined custom iterable cv_custom which defines splitting strategy and is an input for cross validation parameter cv. This iterable should contain one couple for each fold with samples identified by their indices, e.g. ([fold1_train_ids], [fold1_test_ids]), ([fold2_train_ids], [fold2_test_ids]), ... In our case, we need just one couple for one fold with indices of all examples in the train and also in the test part ([train_ids], [test_ids])

N = len(distance_matrix)
cv_custom = [(range(0,N), range(0,N))]
scores = cross_val_score(clf, X, y, cv=cv_custom)
Jakub Macina
  • 922
  • 7
  • 14