4

I am trying to use DBSCAN sklearn implementation for anomaly detection. It works fine for small datasets (500 x 6). However, it runs into memory issues when I try to use a large dataset (180000 x 24). Is there something I can do to overcome this issue?

from sklearn.cluster import DBSCAN
import pandas as pd
from sklearn.preprocessing import StandardScaler
import numpy as np

data = pd.read_csv("dataset.csv")
# Drop non-continuous variables
data.drop(["x1", "x2"], axis = 1, inplace = True)
df = data

data = df.as_matrix().astype("float32", copy = False)

stscaler = StandardScaler().fit(data)
data = stscaler.transform(data)

print "Dataset size:", df.shape

dbsc = DBSCAN(eps = 3, min_samples = 30).fit(data)

labels = dbsc.labels_
core_samples = np.zeros_like(labels, dtype = bool)
core_samples[dbsc.core_sample_indices_] = True

# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)

print('Estimated number of clusters: %d' % n_clusters_)

df['Labels'] = labels.tolist()

#print df.head(10)

print "Number of anomalies:", -1 * (df[df.Labels < 0]['Labels'].sum())
Nira
  • 133
  • 2
  • 8
  • 1
    Possible duplicate of [scikit-learn DBSCAN memory usage](http://stackoverflow.com/questions/16381577/scikit-learn-dbscan-memory-usage) – Has QUIT--Anony-Mousse Sep 06 '16 at 19:31
  • Unfortunately, the sklearn implementation is worst-case O(n^2) (this is *not* standard DBSCAN but due to vectorization for sklearn; e.g. ELKI only uses O(n) memory). You can either use a low-memory implementation, add more memory, and **try using a smaller eps**. 3 on standardized data looks much too large! – Has QUIT--Anony-Mousse Sep 06 '16 at 19:34
  • Okay. Let me try different parameters. Thanks for the response. I am hoping that there is some python implementation which is efficient before I try ELKI or R. – Nira Sep 07 '16 at 03:54
  • I changed the parameters to: dbsc = DBSCAN(eps = 1, min_samples = 15).fit(data) It takes 10GB of memory and 25min, but works fine. Thanks again. – Nira Sep 08 '16 at 02:40

1 Answers1

2

Depending on the type of problem you are tackling could play around this parameter in the DBSCAN constructor:

leaf_size : int, optional (default = 30) Leaf size passed to BallTree or cKDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem.

If that does not suit your needs, this question is already addressed here, you can try to use ELKI's DBSCAN implementation.

Guiem Bosch
  • 2,728
  • 1
  • 21
  • 37
  • Thanks. I did see the other stack overflow question but was hoping that the issue is fixed in latest scikit learn. – Nira Sep 07 '16 at 03:56
  • 1
    leaf_size probably has little effect on the memory cost. The sklearn problem is because it first computes and *stores* all neighbors of all points, then runs DBSCAN. That is why a large eps is problematic. – Has QUIT--Anony-Mousse Sep 07 '16 at 05:19
  • ELKI needs improvement in documentation. – StatguyUser Jan 27 '18 at 08:02