Memory Error during clustering with DBSCAN (large matrix computation)

Question

I'm clustering data with DBSCAN in order to remove outliers. The computation is very memory consuming because the implementation of DBSCAN in scikit-learn can't handle almost 1 GB of data. The problem was already mentioned here

The bottleneck of the following code appears to be the matrix calculation, which is very memory consuming (size of matrix: 10mln x 10mln). Is there a way to optimize the computation of DBSCAN?

My brief research shows that the matrix should be reduced to a sparse matrix in some way to make it feasible to compute.

My ideas how to solve this problem:

create and calculate a sparse matrix
calculate parts of matrix and save them to files and merge them later
perform DBSCAN on small subsets of data and merge the results
switch to Java and use ELKI tool

Code:

import numpy as np
import pandas as pd
import sklearn
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN

# sample data
speed = np.random.uniform(0,25,1000000)
power = np.random.uniform(0,3000,1000000)

# create a dataframe
data_dict = {'speed': speed,
            'power': power}

df = pd.DataFrame(data_dict)

# convert to matrix
df = df.as_matrix().astype("float64", copy = False)

X = data

# normalize data
X = StandardScaler().fit_transform(X)

# precompute matrix of distances
dist_matrix = sklearn.metrics.pairwise.euclidean_distances(X, X)

# perform DBSCAN clustering
db = DBSCAN(eps=0.1, min_samples=60, metric="precomputed", n_jobs=-1).fit(dist_matrix)

Possible duplicate of [scikit-learn DBSCAN memory usage](https://stackoverflow.com/questions/16381577/scikit-learn-dbscan-memory-usage) — Has QUIT--Anony-Mousse, Jun 23 '17 at 20:41

score 1 · Answer 1 · answered Jun 23 '17 at 20:45

1 to 3 will not work.

Your data is dense. There aren't "mostly 0s", so sparse formats will actually need much more memory. The exact thresholds vary, but as a rule of thumb, you'll need at least 90% of 0s for sparse formats to become effective.
DBSCAN does not use a distance matrix.
Working on parts, then merging isn't that easy (there is GriDBSCAN, which does this for Euclidean fistance). You cannot just take random partitions and merge them later.

Memory Error during clustering with DBSCAN (large matrix computation)

1 Answers1