Evaluate clustering performance

Question

My raw data looks like:
```
df =

        long lat long lat long lat long lat 

    1   11   6   15   19  23   27  30   34
    2   12   7   16   20  24   28  31   35
    3   13   8   17   21  25   29  32   36
    ...
    96  14   9   18   22  26   30  33   37
```
Where: column of 1,2,3,..,96 are "taxi_id". It means we have 96 cars.

Other columns are representing location of a car, by assuming them as a couple.

Example: taxi car with a label 1 has location (11,6)(15,19)(23,27)(30,34)
So, I need to cluster them to see the most common trajectories used by these taxi drivers.

To do that I have calculated the "some" distance matrix, then calculated its similarity matrix and applied final matrix to Affinity Propagation

Affinity Propagation code:

from sklearn.cluster import AffinityPropagation

af = AffinityPropagation(preference=-6).fit(X)
cluster_centers_indices = af.cluster_centers_indices_
labels = af.labels_ 

# Some code to calculate number of clusters (3 in this case)
# Some code to check which "taxi_id" related to clusters

And final data looks like:

final_df = 

               long    lat
        1      11      22
    0   2      33      44
        3      55      66
        ...    ...     ...
        45     12      13
    2   46     14      15
        47     16      17

I want to evaluate my clustering. And I do not know how. I did not predict anything, so how can I use the sklearn evaluations metrics? I can not even find a logic (what exactly to evaluate)? Maybe Distance between two clusters (CD)? Do you have any ideas or solution code how to proceed?

score 1 · Accepted Answer · answered Sep 11 '19 at 08:44

I can not even find a logic (what exactly to evaluate)? Maybe Distance between two clusters (CD)?

You are on the right way, one approach is to measure the distance between all cluster points in a cluster. The idea is to test it for a different number of clusters, in your case oyu are only have 3 clusters (0-2).

The silhouette score for example is one of this techniques.

https://en.wikipedia.org/wiki/Silhouette_(clustering)

Do you have any ideas or solution code how to proceed?

Here a a lot of solutions on stackoverflow: How to use silhouette score in k-means clustering from sklearn library?

Another one could be the elbow method for you: Sklearn kmeans equivalent of elbow method

The question of all this methods they try to answer: how many clusters should I pick? If you know the number of clusters you want to have upfront, this can help you to judge about the risk and qualities of the clusters.

Thanks for idea. Actually I was planning to apply `RandomizedSearchCV` by `sklearn`. But I found that it is for supervised learning. What is concerning the cluster number, having found evaluation method, then I will create a loop to change the values of main parameter, which affect my number of cluster (preference and damping in case of Affinity Propagation), and I will apply evaluation in the same loop. — Mamed, Sep 11 '19 at 08:51

score 1 · Answer 2 · answered Jun 18 '20 at 19:23

The clusteval library can be of use. This library contains five methods that can be used to evaluate clusterings; silhouette, dbindex, derivative, *dbscan *and hdbscan.

pip install clusteval

I would suggest dbscan for your case:

# Import library
from clusteval import clusteval

# Set parameters
ce = clusteval(method='dbscan')

# Fit to find optimal number of clusters using dbscan
out = ce.fit(df.values)

# Make plot of the cluster evaluation
ce.plot()

# Make scatter plot. Note that the first two coordinates are used for plotting.
ce.scatter(df.values)

Evaluate clustering performance

2 Answers2