Scikit Learn implementation of DBSCAN for 0.7 million data points with 2 columns (Lat and Long) consumes 128GB+ RAM. How to fix this memory issue?

Question

We are facing memory issues while implementing scikit learn's DBSCAN for 0.7 million data points with 2 columns (latitude and longitude).

We also tried changing epsilon values to small numbers and reducing the number of minimum required points for cluster.

import pandas as pd, numpy as np, matplotlib.pyplot as plt, time
from sklearn.cluster import DBSCAN
from sklearn import metrics
from geopy.distance import great_circle
from shapely.geometry import MultiPoint
%matplotlib inline

kms_per_radian = 6371.0088

df = pd.read_csv('data/xxxxxxxxx.csv', encoding='utf-8')

// represent points consistently as (lat, lon).

coords = df.as_matrix(columns=['lat', 'lon'])

// define epsilon as 1.5 kilometers, converted to radians for use by haversine.

epsilon = 1.5 / kms_per_radian



db = DBSCAN(eps=epsilon, min_samples=1, algorithm='ball_tree', metric='haversine').fit(np.radians(coords))

Kernel killed automatically. Please suggest ways to optimize the code.

Possible duplicate of [scikit-learn DBSCAN memory usage](https://stackoverflow.com/questions/16381577/scikit-learn-dbscan-memory-usage) — PV8, Jun 06 '19 at 11:43
find the answer here: https://stackoverflow.com/questions/16381577/scikit-learn-dbscan-memory-usage — PV8, Jun 06 '19 at 11:44

score 0 · Answer 1 · answered Oct 01 '21 at 14:51

0

DBScan pre-computes the full O(n^2) distance matrix. Possible solution is to play around and decrease your eps value.

answered Oct 01 '21 at 14:51

user16987341

51
4

This does not provide an answer to the question. Once you have sufficient [reputation](https://stackoverflow.com/help/whats-reputation) you will be able to [comment on any post](https://stackoverflow.com/help/privileges/comment); instead, [provide answers that don't require clarification from the asker](https://meta.stackexchange.com/questions/214173/why-do-i-need-50-reputation-to-comment-what-can-i-do-instead). - [From Review](/review/late-answers/29971331) – m4n0 Oct 02 '21 at 18:01

Scikit Learn implementation of DBSCAN for 0.7 million data points with 2 columns (Lat and Long) consumes 128GB+ RAM. How to fix this memory issue?

1 Answers1