Application of DBSCAN on a large .csv file results in a disc use overshoot to 100% and hangs my computer

Question

So my task is to read data out of a .csv file and form clusters. My code works fine on a small .csv file but when i try to read the original file that i have to work on (it contains about 24k lines) my computer hangs and disk use shoots to 100% and i have t0 restart the system. I am at a dead end here and have no idea what's happening. the DBSCAN code is the same as provided as a demo on sklearn site. however the the code for reading the data i wrote myself

import csv
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
from sklearn import metrics
#from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler
import csv



def dbFun( _x,_original_vals):
    db = DBSCAN(eps=0.3, min_samples=20).fit(_x)
    core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
    core_samples_mask[db.core_sample_indices_] = True

    labels = db.labels_
    #print(labels)
    n_clusters_ = len(set(labels)) - (1 if -1 else 0)
    print('Estimated number of clusters: %d' % n_clusters_)
    print("Wait plotting clusters.....")
    plotCluster(_x, labels, core_samples_mask, n_clusters_)
    return


def plotCluster( _x, labels, core_samples_mask, n_clusters_):
   unique_labels = set(labels)
   colors = [plt.cm.Spectral(each)
          for each in np.linspace(0, 1, len(unique_labels))]
   for k, col in zip(unique_labels, colors):
      if k == -1:
        # Black used for noise.
         col = [0, 0, 0, 1]

      class_member_mask = (labels == k)

      xy = _x[class_member_mask & core_samples_mask]
      plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=14)

      xy = _x[class_member_mask & ~core_samples_mask]
      plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=6)

   plt.title('Estimated number of clusters: %d' % n_clusters_)
   plt.show()
   return

_val = []



with open('C:/Users/hp 5th/Desktop/new1.csv', 'rU') as inp:

        rd = csv.reader(inp)
        for row in rd:
            _val.append([row[1],row[2], row[0]])

    #print(_center)
_val = np.asarray(_val)
_val_original = _val
_val_original =_val_original.astype('float32')
_val = StandardScaler().fit_transform(_val_original)

dbFun(_val, _val_original)
    #_len = len(_center)

score 1 · Answer 1 · answered Jun 23 '17 at 07:48

1

This is called "swapping", I.e. you have too little memory.

The sklearn DBSCAN implementation is worst case O(n²) memory.

Use ELKI with an index instead. It needs much less memory than sklearn.

answered Jun 23 '17 at 07:48

Has QUIT--Anony-Mousse

76,138
12
138
194

yes i guess that is it. because the rest of the code is pretty much straightforward. i will definitely look into ELKI. – Bahtawar Z Jun 23 '17 at 11:37
however i am not familiar with java. is there any way i could do this in python? moreover this task is just a part of a bigger project of mine. i dont know how to integrate ELKI with the rest of it – Bahtawar Z Jun 23 '17 at 11:43
You can just use the "subprocess" module, and use CSV files. – Has QUIT--Anony-Mousse Jun 23 '17 at 20:38

Application of DBSCAN on a large .csv file results in a disc use overshoot to 100% and hangs my computer

1 Answers1