DBSCAN handling big data crashes and memory error

Question

I am doing a DBSCAN on a dataset of 400K data points. Here is what i get as the error:

Traceback (most recent call last):
  File "/myproject/DBSCAN_section.py", line 498, in perform_dbscan_on_data
    db = DBSCAN(eps=2, min_samples=5).fit(data)
  File "/usr/local/Python/2.7.13/lib/python2.7/site-packages/sklearn/cluster/dbscan_.py", line 266, in fit
    **self.get_params())
  File "/usr/local/Python/2.7.13/lib/python2.7/site-packages/sklearn/cluster/dbscan_.py", line 138, in dbscan
    return_distance=False)
  File "/usr/local/Python/2.7.13/lib/python2.7/site-packages/sklearn/neighbors/base.py", line 621, in radius_neighbors
    return_distance=return_distance)
  File "sklearn/neighbors/binary_tree.pxi", line 1491, in sklearn.neighbors.kd_tree.BinaryTree.query_radius (sklearn/neighbors/kd_tree.c:13013)
MemoryError

How can I fix this? is there any limit to DBSCAN to process the big number of data?

my source of example is from: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html

my data is in X, Y coordinates format:

11.342276,11.163416
11.050597,10.745579
10.798838,10.559784
11.249279,11.445535
11.385767,10.989214
10.825875,10.530120
10.598493,11.236947
10.571042,10.830799
11.454966,11.295484
11.431454,11.200208
10.774908,11.102601
10.602692,11.395169
11.324441,11.088243
10.731538,10.695864
10.537385,10.923226
11.215886,11.391537

should I convert my data to sparse CSR? how?

Could you not sample your data at, say, a 5-10-25% rate to more easily fit it into memory? Also, I'm surprised 800k floats is running you out of RAM, how much RAM do you have?! — Henry, May 23 '17 at 12:20
@Henry my RAM is around 64GB !!!what do you mean by sampling? i should consider all my data points one by one... i can't exclude any. — passion, May 23 '17 at 13:10
you could, for example, take 10% of your data points, cluster over them to generate your centroids, and then see how those centroids fit your full data set? — Henry, May 23 '17 at 13:37
@Henry i see now, could you give an example of this way of doing it. thanks — passion, May 23 '17 at 13:50
Not really. I dont have the familiarity with the data set, for example, not the DBScan technique. But you could just take a random 10% of your input data set, run DBScan over it as you do now, and then take the resulting centroids? from DBScan, and compare it against your remaining 90% of data sets, and see how the distances there compare against your distances in the sample — Henry, May 23 '17 at 13:52

score 1 · Answer 1 · answered May 23 '17 at 20:10

1

sklearn's DBSCAN needs O(n*k) memory, where k is the number of neighbors within epsilon. For a large data set, and epsilon, this will be a problem. For a small data set, it is faster on Python, because it does more work in Cython, outside of the slow interpreter. The sklearn authors chose to do this variation. For now, consider using a smaller epsilon, too.

But this is not what the original DBSCAN proposed, and other implementations such as ELKI's a known to scale to millions of points. It queries one point at a time, so it needs only O(n+k) memory. It also has OPTICS, which is reported to work very well on coordinates.

answered May 23 '17 at 20:10

Has QUIT--Anony-Mousse

76,138
12
138
194

great thanks, could you share some libs or packages that do the same as you said? some example? please – passion May 24 '17 at 07:48
I did. But it's not Python. – Has QUIT--Anony-Mousse May 25 '17 at 00:24
@Anony-Mousse your answers are quite helpful on the subject of text clustering, will it possible for you to create a short tutorial for everyone about how to use ELKI? – StatguyUser Jan 31 '18 at 14:48
I am way too detail obsessed. My tutorials suck, and I suck at teaching. Beginners are better at writing a tutorial (you can still get feedback!) because they understand beginners needs better. – Has QUIT--Anony-Mousse Jan 31 '18 at 18:43

DBSCAN handling big data crashes and memory error

1 Answers1