I'm working on a Classification Problem where I know the label. I'm comparing 2 different algorithms K-Means and DBSCAN. However the latter has the famous problem with the Memory for computing the metric distance. But If in my dataset there are a lot of duplicated samples can I delete them and count their occurrences and after that use this weight in the Algorithm ? Everything for saving memory.
I do not know how to do it . This is my code:
df = dimensionality_reduction(dataframe = df_balanced_train)
train = np.array(df.iloc[:,1:])
### DBSCAN
#Here the centroids there aren't
y_dbscan, centroidi = Cluster(data = train, algo = "DBSCAN")
err, colori = error_Cluster(y_dbscan, df)
#These are the functions:
#DBSCAN Algorithm
#nbrs = NearestNeighbors(n_neighbors= 1500).fit(data)
#distances, indices = nbrs.kneighbors(data)
#print("The mean distance is about : " + str(np.mean(distances)))
#np.median(distances)
dbscan = DBSCAN(eps= 0.9, min_samples= 1000, metric="euclidean",
n_jobs = 1)
y_result = dbscan.fit_predict(data)
centroidi = "In DBSCAN there are not Centroids"
For a sample of 30k elements everything ok but for 800k always prloblem with the memory, could solve my problem delete dupliates and count thir occurrences ?