1

I am using scikit-learn to cluster a large amount of data. I have a large sparse matrix (44104 by 755144 elements where most are 0). I want to use DBSCAN for the clustering since it makes sense for my problem, and since it allows me not to specify the number of clusters.

It seems possible to use DBSCAN with sparse data and this is also discussed here In scikit-learn, can DBSCAN use sparse matrix?.

However, there seems to be some issues if I am to use the dice metric. In dbscan_.py where DBSCAN is implemented, I can remove the line:

X = np.asarray(X)

which will not work with my sparse matrix, but then I run into problems in:

neighbors_model = NearestNeighbors(radius=eps, algorithm=algorithm,
                                           leaf_size=leaf_size,
                                           metric=metric, p=p)
neighbors_model.fit(X)

where I get the error:

ValueError: metric 'dice' not valid for sparse input

Is there a good reason that the dice-metric is not valid for sparse input or is it something that I can implement myself?

Community
  • 1
  • 1
utdiscant
  • 11,128
  • 8
  • 31
  • 40
  • I am not entirely sure but I think this is just not implemented (yet). Which version of sklearn are you using? It seems strange that you need to remove np.asarray. – Andreas Mueller Jul 16 '14 at 14:53
  • I guess DBSCAN can support sparse matrices for some metrics; it's just that it's primarily a geo method, so Haversine or Euclidean distance in 2-d or 3-d space is the primary use case and high-d use cases where not considered. (The curse of dimensionality will also make it hard to guess the right DBSCAN parameters...) – Fred Foo Jul 16 '14 at 14:55
  • I am using version 0.15.0b. Currently I am going for making my own implementation of DBSCAN using MinHash and Jaccard similarity. – utdiscant Jul 17 '14 at 10:54

0 Answers0