10

I am using DBSCAN for clustering. However, now I want to pick a point from each cluster that represents it, but I realized that DBSCAN does not have centroids as in kmeans.

However, I observed that DBSCAN has something called core points. I am thinking if it is possible to use these core points or any other alternative to obtain a representative point from each cluster.

I have mentioned below the code that I have used.

import numpy as np
from math import pi
from sklearn.cluster import DBSCAN

#points containing time value in minutes
points = [100, 200, 600, 659, 700]

def convert_to_radian(x):
    return((x / (24 * 60)) * 2 * pi)

rad_function = np.vectorize(convert_to_radian)
points_rad = rad_function(points)

#generate distance matrix from each point
dist = points_rad[None,:] - points_rad[:, None]

#Assign shortest distances from each point
dist[((dist > pi) & (dist <= (2*pi)))] = dist[((dist > pi) & (dist <= (2*pi)))] -(2*pi)
dist[((dist > (-2*pi)) & (dist <= (-1*pi)))] = dist[((dist > (-2*pi)) & (dist <= (-1*pi)))] + (2*pi) 
dist = abs(dist)

#check dist
print(dist)

#using default values, set metric to 'precomputed'
db = DBSCAN(eps=((100 / (24*60)) * 2 * pi ), min_samples = 2, metric='precomputed')

#check db
print(db)

db.fit(dist)

#get labels
labels = db.labels_

#get number of clusters
no_clusters = len(set(labels)) - (1 if -1 in labels else 0)

print('No of clusters:', no_clusters)
print('Cluster 0 : ', np.nonzero(labels == 0)[0])
print('Cluster 1 : ', np.nonzero(labels == 1)[0])

print(db.core_sample_indices_)

I am happy to provide more details if needed.

Jitesh Malipeddi
  • 2,150
  • 3
  • 17
  • 37
EmJ
  • 4,398
  • 9
  • 44
  • 105
  • 2
    Just in case you don't know: Kmeans is a centroid-based method (each cluster is just a centroid and all points belong to the nearest centroid). DBSCAN is density-based, so the resulting clusters can have any shape, as long as there are points close enough to each other. So DBSCAN could also result in a "ball"-cluster in the center with a "circle"-cluster around it. Both clusters would have the same "centroid" in that case, which is the reason why computing centroids for DBSCAN results can be highly misleading. So take care when working with those centroids (or use a centroid-based method). – Niklas Mertsch Jun 06 '20 at 08:39

2 Answers2

6

Why don't you estimate the centroids of the resulted estimated clusters?

points_of_cluster_0 = dist[labels==0,:]
centroid_of_cluster_0 = np.mean(points_of_cluster_0, axis=0) 
print(centroid_of_cluster_0)

points_of_cluster_1 = dist[labels==1,:]
centroid_of_cluster_1 = np.mean(points_of_cluster_1, axis=0)
print(centroid_of_cluster_1)
seralouk
  • 30,938
  • 9
  • 118
  • 133
0

Maybe, do KDE row by row like (e.g. density_i = np.where(cdist(x[i:i+1],x[inds])-cut_off<0,1,0).sum(1)) for each cluster {i.e., i in inds, where inds=np.argwhere(cluster_results==cluster_index)} and find the point with highest density in each cluster; that is the most representative centroid. This may still can be slow if dataset is massive.

Chonk
  • 11
  • 2
  • NB; as mentioned in above comment; non Euclidean dataset q needs to first be represented/featurized/mapped to Euclidean coordinate system x:=map(q), even before going into DBSCAN. [In terms of the two GPS coordinates one of those (around the equator one) is mapped to a 2D circle (by [sin,cos](q[:,0])) and the other one (north to south) probably to a semi-circle (by [cos](q[:,1])), so x is 3D.] – Chonk Jul 27 '22 at 01:44