I am using scikit-learn to cluster a large amount of data. I have a large sparse matrix (44104 by 755144 elements where most are 0). I want to use DBSCAN for the clustering since it makes sense for my problem, and since it allows me not to specify the number of clusters.
It seems possible to use DBSCAN with sparse data and this is also discussed here In scikit-learn, can DBSCAN use sparse matrix?.
However, there seems to be some issues if I am to use the dice metric. In dbscan_.py where DBSCAN is implemented, I can remove the line:
X = np.asarray(X)
which will not work with my sparse matrix, but then I run into problems in:
neighbors_model = NearestNeighbors(radius=eps, algorithm=algorithm,
leaf_size=leaf_size,
metric=metric, p=p)
neighbors_model.fit(X)
where I get the error:
ValueError: metric 'dice' not valid for sparse input
Is there a good reason that the dice-metric is not valid for sparse input or is it something that I can implement myself?