1
  1. My raw data looks like:

    df =
    
            long lat long lat long lat long lat 
    
        1   11   6   15   19  23   27  30   34
        2   12   7   16   20  24   28  31   35
        3   13   8   17   21  25   29  32   36
        ...
        96  14   9   18   22  26   30  33   37
    

    Where: column of 1,2,3,..,96 are "taxi_id". It means we have 96 cars.

    Other columns are representing location of a car, by assuming them as a couple.

    Example: taxi car with a label 1 has location (11,6)(15,19)(23,27)(30,34)

  2. So, I need to cluster them to see the most common trajectories used by these taxi drivers.

    To do that I have calculated the "some" distance matrix, then calculated its similarity matrix and applied final matrix to Affinity Propagation

  3. Affinity Propagation code:

    from sklearn.cluster import AffinityPropagation
    
    af = AffinityPropagation(preference=-6).fit(X)
    cluster_centers_indices = af.cluster_centers_indices_
    labels = af.labels_ 
    
    # Some code to calculate number of clusters (3 in this case)
    # Some code to check which "taxi_id" related to clusters
    
  4. And final data looks like:

    final_df = 
    
                   long    lat
            1      11      22
        0   2      33      44
            3      55      66
            ...    ...     ...
            45     12      13
        2   46     14      15
            47     16      17
    

I want to evaluate my clustering. And I do not know how. I did not predict anything, so how can I use the sklearn evaluations metrics? I can not even find a logic (what exactly to evaluate)? Maybe Distance between two clusters (CD)? Do you have any ideas or solution code how to proceed?

jonrsharpe
  • 115,751
  • 26
  • 228
  • 437
Mamed
  • 1,102
  • 8
  • 23

2 Answers2

1

I can not even find a logic (what exactly to evaluate)? Maybe Distance between two clusters (CD)?

You are on the right way, one approach is to measure the distance between all cluster points in a cluster. The idea is to test it for a different number of clusters, in your case oyu are only have 3 clusters (0-2).

The silhouette score for example is one of this techniques.

https://en.wikipedia.org/wiki/Silhouette_(clustering)

Do you have any ideas or solution code how to proceed?

Here a a lot of solutions on stackoverflow: How to use silhouette score in k-means clustering from sklearn library?

Another one could be the elbow method for you: Sklearn kmeans equivalent of elbow method

The question of all this methods they try to answer: how many clusters should I pick? If you know the number of clusters you want to have upfront, this can help you to judge about the risk and qualities of the clusters.

PV8
  • 5,799
  • 7
  • 43
  • 87
  • 1
    Thanks for idea. Actually I was planning to apply `RandomizedSearchCV` by `sklearn`. But I found that it is for supervised learning. What is concerning the cluster number, having found evaluation method, then I will create a loop to change the values of main parameter, which affect my number of cluster (preference and damping in case of Affinity Propagation), and I will apply evaluation in the same loop. – Mamed Sep 11 '19 at 08:51
1

The clusteval library can be of use. This library contains five methods that can be used to evaluate clusterings; silhouette, dbindex, derivative, *dbscan *and hdbscan.

pip install clusteval

I would suggest dbscan for your case:

# Import library
from clusteval import clusteval

# Set parameters
ce = clusteval(method='dbscan')

# Fit to find optimal number of clusters using dbscan
out = ce.fit(df.values)

# Make plot of the cluster evaluation
ce.plot()

# Make scatter plot. Note that the first two coordinates are used for plotting.
ce.scatter(df.values)
erdogant
  • 1,544
  • 14
  • 23