7

Reading around, I find it is possible to pass a precomputed distance matrix into SKLearn DBSCAN. Unfortunately, I don't know how to pass it for calculation.

Say I have a 1D array with 100 elements, with just the names of the nodes. Then I have a 2D matrix, 100x100 with the distance between each element (in the same order).

I know I have to call it:

db = DBSCAN(eps=2, min_samples=5, metric="precomputed")

For a distance between nodes of 2 and a minimum of 5 node clusters. Also, use "precomputed" to indicate to use the 2D matrix. But how do I pass the info for the calculation?

The same question could apply if using RAPIDS CUML DBScan function (GPU accelerated).

Jaime Nebrera
  • 79
  • 1
  • 2

1 Answers1

8

documentation:

class sklearn.cluster.DBSCAN(eps=0.5, *, min_samples=5, metric='euclidean', 
metric_params=None, algorithm='auto', leaf_size=30, p=None, n_jobs=None)
[...]
[...]
metricstring, or callable, default=’euclidean’
The metric to use when calculating distance between instances in a feature array. If 
metric is a string or callable, it must be one of the options allowed by 
sklearn.metrics.pairwise_distances for its metric parameter. If metric is 
“precomputed”, X is assumed to be a distance matrix and must be square. X may be a 
Glossary, in which case only “nonzero” elements may be considered neighbors for  
DBSCAN.
[...]

So, the way you normally call this is:

from sklearn.cluster import DBSCAN

clustering = DBSCAN()
DBSCAN.fit(X)

if you have a distance matrix, you do:

from sklearn.cluster import DBSCAN

clustering = DBSCAN(metric='precomputed')
clustering.fit(distance_matrix)
warped
  • 8,947
  • 3
  • 22
  • 49