10

I got Memory Error when I was running dbscan algorithm of scikit. My data is about 20000*10000, it's a binary matrix.

(Maybe it's not suitable to use DBSCAN with such a matrix. I'm a beginner of machine learning. I just want to find a cluster method which don't need an initial cluster number)

Anyway I found sparse matrix and feature extraction of scikit.

http://scikit-learn.org/dev/modules/feature_extraction.html http://docs.scipy.org/doc/scipy/reference/sparse.html

But I still have no idea how to use it. In DBSCAN's specification, there is no indication about using sparse matrix. Is it not allowed?

If anyone knows how to use sparse matrix in DBSCAN, please tell me. Or you can tell me a more suitable cluster method.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
user2147650
  • 183
  • 2
  • 7
  • Possible duplicate of [scikit-learn DBSCAN memory usage](http://stackoverflow.com/questions/16381577/scikit-learn-dbscan-memory-usage) – Nikana Reklawyks Jan 17 '17 at 03:56

4 Answers4

8

The scikit implementation of DBSCAN is, unfortunately, very naive. It needs to be rewritten to take indexing (ball trees etc.) into account.

As of now, it will apparently insist of computing a complete distance matrix, which wastes a lot of memory.

May I suggest that you just reimplement DBSCAN yourself. It's fairly easy, there exists good pseudocode e.g. on Wikipedia and in the original publication. It should be just a few lines, and you can then easily take benefit of your data representation. E.g. if you already have a similarity graph in a sparse representation, it's usually fairly trivial to do a "range query" (i.e. use only the edges that satisfy your distance threshold)

Here is a issue in scikit-learn github where they talk about improving the implementation. A user reports his version using the ball-tree is 50x faster (which doesn't surprise me, I've seen similar speedups with indexes before - it will likely become more pronounced when further increasing the data set size).

Update: the DBSCAN version in scikit-learn has received substantial improvements since this answer was written.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
1

You can pass a distance matrix to DBSCAN, so assuming X is your sample matrix, the following should work:

from sklearn.metrics.pairwise import euclidean_distances

D = euclidean_distances(X, X)
db = DBSCAN(metric="precomputed").fit(D)

However, the matrix D will be even larger than X: n_samples² entries. With sparse matrices, k-means is probably the best option.

(DBSCAN may seem attractive because it doesn't need a pre-determined number of clusters, but it trades that for two parameters that you have to tune. It's mostly applicable in settings where the samples are points in space and you know how close you want those points to be to be in the same cluster, or when you have a black box distance metric that scikit-learn doesn't support.)

Fred Foo
  • 355,277
  • 75
  • 744
  • 836
  • 1
    DBSCAN doesn't require the distance matrix, that is a limitation of the current sklearn implementation, not of the algorithm. Plus, in many cases, both the epsion and the minpts parameter of DBSCAN can be chosen *much* easier than `k`. When using geographic data for example, a user may well be able to say that a radius of "1 km" is a good epsilon, and that there should be at least 10 events within this radius. – Has QUIT--Anony-Mousse May 25 '13 at 12:38
  • @Anony-Mousse: I'm aware that the problems are in the implementation, not the algorithm. As for picking eps and minpts, yes, for some problems that may be easy, but for others, extensive tuning may be required. Not all problems live in Euclidean space or even on the surface of the earth. – Fred Foo May 25 '13 at 12:49
1

Yes, since version 0.16.1. Here's a commit for a test:

https://github.com/scikit-learn/scikit-learn/commit/494b8e574337e510bcb6fd0c941e390371ef1879

K.-Michael Aye
  • 5,465
  • 6
  • 44
  • 56
0

Sklearn's DBSCAN algorithm doesn't take sparse arrays. However, KMeans and Spectral clustering do, you can try these. More on sklearns clustering methods: http://scikit-learn.org/stable/modules/clustering.html

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
Ando Saabas
  • 1,967
  • 14
  • 12