I'm clustering data with DBSCAN in order to remove outliers. The computation is very memory consuming because the implementation of DBSCAN in scikit-learn can't handle almost 1 GB of data. The problem was already mentioned here
The bottleneck of the following code appears to be the matrix calculation, which is very memory consuming (size of matrix: 10mln x 10mln). Is there a way to optimize the computation of DBSCAN?
My brief research shows that the matrix should be reduced to a sparse matrix in some way to make it feasible to compute.
My ideas how to solve this problem:
- create and calculate a sparse matrix
- calculate parts of matrix and save them to files and merge them later
- perform DBSCAN on small subsets of data and merge the results
- switch to Java and use ELKI tool
Code:
import numpy as np
import pandas as pd
import sklearn
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN
# sample data
speed = np.random.uniform(0,25,1000000)
power = np.random.uniform(0,3000,1000000)
# create a dataframe
data_dict = {'speed': speed,
'power': power}
df = pd.DataFrame(data_dict)
# convert to matrix
df = df.as_matrix().astype("float64", copy = False)
X = data
# normalize data
X = StandardScaler().fit_transform(X)
# precompute matrix of distances
dist_matrix = sklearn.metrics.pairwise.euclidean_distances(X, X)
# perform DBSCAN clustering
db = DBSCAN(eps=0.1, min_samples=60, metric="precomputed", n_jobs=-1).fit(dist_matrix)