0

I am doing a DBSCAN on a dataset of 400K data points. Here is what i get as the error:

Traceback (most recent call last):
  File "/myproject/DBSCAN_section.py", line 498, in perform_dbscan_on_data
    db = DBSCAN(eps=2, min_samples=5).fit(data)
  File "/usr/local/Python/2.7.13/lib/python2.7/site-packages/sklearn/cluster/dbscan_.py", line 266, in fit
    **self.get_params())
  File "/usr/local/Python/2.7.13/lib/python2.7/site-packages/sklearn/cluster/dbscan_.py", line 138, in dbscan
    return_distance=False)
  File "/usr/local/Python/2.7.13/lib/python2.7/site-packages/sklearn/neighbors/base.py", line 621, in radius_neighbors
    return_distance=return_distance)
  File "sklearn/neighbors/binary_tree.pxi", line 1491, in sklearn.neighbors.kd_tree.BinaryTree.query_radius (sklearn/neighbors/kd_tree.c:13013)
MemoryError

How can I fix this? is there any limit to DBSCAN to process the big number of data?

my source of example is from: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html

my data is in X, Y coordinates format:

11.342276,11.163416
11.050597,10.745579
10.798838,10.559784
11.249279,11.445535
11.385767,10.989214
10.825875,10.530120
10.598493,11.236947
10.571042,10.830799
11.454966,11.295484
11.431454,11.200208
10.774908,11.102601
10.602692,11.395169
11.324441,11.088243
10.731538,10.695864
10.537385,10.923226
11.215886,11.391537

should I convert my data to sparse CSR? how?

passion
  • 1,000
  • 6
  • 20
  • 47
  • Could you not sample your data at, say, a 5-10-25% rate to more easily fit it into memory? Also, I'm surprised 800k floats is running you out of RAM, how much RAM do you have?! – Henry May 23 '17 at 12:20
  • @Henry my RAM is around 64GB !!!what do you mean by sampling? i should consider all my data points one by one... i can't exclude any. – passion May 23 '17 at 13:10
  • you could, for example, take 10% of your data points, cluster over them to generate your centroids, and then see how those centroids fit your full data set? – Henry May 23 '17 at 13:37
  • @Henry i see now, could you give an example of this way of doing it. thanks – passion May 23 '17 at 13:50
  • 1
    Not really. I dont have the familiarity with the data set, for example, not the DBScan technique. But you could just take a random 10% of your input data set, run DBScan over it as you do now, and then take the resulting centroids? from DBScan, and compare it against your remaining 90% of data sets, and see how the distances there compare against your distances in the sample – Henry May 23 '17 at 13:52
  • 2
    @Henry DBSCAN doesn't use centroids. – Has QUIT--Anony-Mousse May 25 '17 at 00:26

1 Answers1

1

sklearn's DBSCAN needs O(n*k) memory, where k is the number of neighbors within epsilon. For a large data set, and epsilon, this will be a problem. For a small data set, it is faster on Python, because it does more work in Cython, outside of the slow interpreter. The sklearn authors chose to do this variation. For now, consider using a smaller epsilon, too.

But this is not what the original DBSCAN proposed, and other implementations such as ELKI's a known to scale to millions of points. It queries one point at a time, so it needs only O(n+k) memory. It also has OPTICS, which is reported to work very well on coordinates.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194