We are facing memory issues while implementing scikit learn's DBSCAN for 0.7 million data points with 2 columns (latitude and longitude).
We also tried changing epsilon values to small numbers and reducing the number of minimum required points for cluster.
import pandas as pd, numpy as np, matplotlib.pyplot as plt, time
from sklearn.cluster import DBSCAN
from sklearn import metrics
from geopy.distance import great_circle
from shapely.geometry import MultiPoint
%matplotlib inline
kms_per_radian = 6371.0088
df = pd.read_csv('data/xxxxxxxxx.csv', encoding='utf-8')
// represent points consistently as (lat, lon).
coords = df.as_matrix(columns=['lat', 'lon'])
// define epsilon as 1.5 kilometers, converted to radians for use by haversine.
epsilon = 1.5 / kms_per_radian
db = DBSCAN(eps=epsilon, min_samples=1, algorithm='ball_tree', metric='haversine').fit(np.radians(coords))
Kernel killed automatically. Please suggest ways to optimize the code.