0

We are facing memory issues while implementing scikit learn's DBSCAN for 0.7 million data points with 2 columns (latitude and longitude).

We also tried changing epsilon values to small numbers and reducing the number of minimum required points for cluster.

import pandas as pd, numpy as np, matplotlib.pyplot as plt, time
from sklearn.cluster import DBSCAN
from sklearn import metrics
from geopy.distance import great_circle
from shapely.geometry import MultiPoint
%matplotlib inline

kms_per_radian = 6371.0088

df = pd.read_csv('data/xxxxxxxxx.csv', encoding='utf-8')

// represent points consistently as (lat, lon).

coords = df.as_matrix(columns=['lat', 'lon'])

// define epsilon as 1.5 kilometers, converted to radians for use by haversine.

epsilon = 1.5 / kms_per_radian



db = DBSCAN(eps=epsilon, min_samples=1, algorithm='ball_tree', metric='haversine').fit(np.radians(coords))

Kernel killed automatically. Please suggest ways to optimize the code.

HK boy
  • 1,398
  • 11
  • 17
  • 25
  • 2
    Possible duplicate of [scikit-learn DBSCAN memory usage](https://stackoverflow.com/questions/16381577/scikit-learn-dbscan-memory-usage) – PV8 Jun 06 '19 at 11:43
  • find the answer here: https://stackoverflow.com/questions/16381577/scikit-learn-dbscan-memory-usage – PV8 Jun 06 '19 at 11:44

1 Answers1

0

DBScan pre-computes the full O(n^2) distance matrix. Possible solution is to play around and decrease your eps value.

  • This does not provide an answer to the question. Once you have sufficient [reputation](https://stackoverflow.com/help/whats-reputation) you will be able to [comment on any post](https://stackoverflow.com/help/privileges/comment); instead, [provide answers that don't require clarification from the asker](https://meta.stackexchange.com/questions/214173/why-do-i-need-50-reputation-to-comment-what-can-i-do-instead). - [From Review](/review/late-answers/29971331) – m4n0 Oct 02 '21 at 18:01