1

For a project that I am currently working on, I need to cluster a relatively large number of pairs of GPS into different location clusters. After reading many posts and suggestions here in StackOverflow and taking different approaches, I still have the problem of running it...

Dataset size: a little over 200 thousands pairs of GPS coordinates

[[108.67235   22.38068 ]
 [110.579506  16.173908]
 [111.34595   23.1978  ]
 ...
 [118.50778   23.03158 ]
 [118.79726   23.83771 ]
 [123.088512  21.478443]]   

Methods tried: 1. HDBSCAN package

coordinates = df5.values
print(coordinates)
clusterer = hdbscan.HDBSCAN(metric='haversine', min_cluster_size=15)
clusterer.fit(coordinates)
  1. DBSCAN min_samples=15, metric= haversine, algorithm='ball_tree'

  2. Taking the advice of Anony-Mousse, I have tried ELKI as well. ELKI UI SETTING

And all these methods gave me the same Memory Error

I have read these posts: DBSCAN for clustering of geographic location data Clustering 500,000 geospatial points in python

All these posts suggested that the size of a dataset should not be a problem. However, somehow I kept getting the error message. I am sorry if this winds to be a simple answer. Is it because of the settings? or simply because I am running it on my laptop with 16G memory...?

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
Timothy.L
  • 19
  • 3
  • A memory error does not make sense. Because DBSCAN only needs O(n) memory, so you should be able to run this on a raspberry pi given enough time. Please provide a memory dump & other diagnostic data, not just a vague "out of memory" description. Maybe you have too much other stuff open? What days size does still work? – Has QUIT--Anony-Mousse Aug 22 '18 at 21:58
  • Just for the record: 200k points need about 3.5 MB RAM. Even with some overhead, DBSCAN should be able to process this easily with a few megabytes of RAM, not gigabytes. Show the real error message! – Has QUIT--Anony-Mousse Aug 22 '18 at 22:03
  • @Anony-Mousse Hi!! You are my lucky star! For whatever reason, I tried running it with DBSCAN this morning as a kick-start of my day, it went through! :) Thanks!!! – Timothy.L Aug 23 '18 at 08:42
  • You maybe should keep less stuff open & check the task manager from time to time to see what process is using up your memory... From time to time, shut down kennels you don't use! – Has QUIT--Anony-Mousse Aug 23 '18 at 12:08

1 Answers1

1

For sklearn: I faced the same problem when i was using old version on sklearn 0.19.1 because the complexity was O(N^2).

But now the problem has been resolved in new version 0.20.2 and no memory error anymore, and the complexity become O(n.d) where d is the average number of neighbors. it's not the idol complexity but much better than old versions.

Check the notes in this release, to avoid high memory usage: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html

MMamdouh
  • 67
  • 6