How remove duplicates from a dataframe and create new one with the weight for each sample?

Question

I'm working on a Classification Problem where I know the label. I'm comparing 2 different algorithms K-Means and DBSCAN. However the latter has the famous problem with the Memory for computing the metric distance. But If in my dataset there are a lot of duplicated samples can I delete them and count their occurrences and after that use this weight in the Algorithm ? Everything for saving memory.

I do not know how to do it . This is my code:


df  = dimensionality_reduction(dataframe = df_balanced_train)
train = np.array(df.iloc[:,1:])

### DBSCAN

#Here the centroids there aren't
y_dbscan, centroidi = Cluster(data = train, algo = "DBSCAN")
err, colori = error_Cluster(y_dbscan, df)

#These are the functions:

        #DBSCAN Algorithm

        #nbrs = NearestNeighbors(n_neighbors= 1500).fit(data)
        #distances, indices = nbrs.kneighbors(data)
        #print("The mean distance is about : " + str(np.mean(distances)))
        #np.median(distances)

        dbscan = DBSCAN(eps= 0.9, min_samples= 1000, metric="euclidean", 
                        n_jobs = 1)

        y_result = dbscan.fit_predict(data)
        centroidi = "In DBSCAN there are not Centroids"

For a sample of 30k elements everything ok but for 800k always prloblem with the memory, could solve my problem delete dupliates and count thir occurrences ?

why not.. go for it. Its called preprocessing.. as long as you can retain the variables accounting for maximum variance and remove redundant variables, or reduce dimensionality or do feature selection. These steps will help in giving concise clusters as well as reduce the memory footprint. — mnm, Apr 29 '19 at 10:25
But how can I drop all the equal rows of a pandas dataframe and add a new column with the number of occurrences of each row , so I will use this column as a weight for DBSCAN — Davide Aureli, Apr 29 '19 at 10:51
see this [post](https://stackoverflow.com/questions/23667369/drop-all-duplicate-rows-in-python-pandas) on dropping duplicates in python. And [this one](https://stackoverflow.com/questions/37077898/pandas-dataframe-how-to-add-column-with-number-of-occurrences-in-other-column) for finding the number of occurrences. Google is your best friend. Try nurture a habit of finding answers on your own, and if the problem persists then ASK a question here. And in your post, state all the steps you took to solve the problem including references like I have given. Your lucky this question is NOT downvoted. — mnm, Apr 29 '19 at 14:34
also, do not THROW in the code erratically. If you want to add code, ensure the code and its output is reproducible. See this [post](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) and learn how to create a minimum reproducible example. — mnm, Apr 29 '19 at 14:37

score 3 · Answer 1 · answered May 07 '19 at 06:08

DBSCAN should take only O(n) memory - just as k-means.

But apparently the sklearn implementation does a version that first computes all neighbors, and thus uses O(n²) memory, and hence less scalable. I'd consider this a bug in sklearn, but apparently they are well aware of this limitation, which seems to be mostly a problem when you choose bad parameters. To guarantee O(n) memory it may be enough to just implement the standard DBSCAN yourself.

Merging duplicates is certainly an option, but A) that usually means you are using inappropriate data for these algorithms resp. for this distance and B) you'll also need to implement the algorithms yourself to add support for weight. Because you need to use weight sums instead of result counts etc. in DBSCAN.

Last but not least: if you have labels and a classification problem, these seem to be the wrong choice. They are clustering, not classification. Their job is not to recreate the labels you have, but to find new labels from the data.

Thank you for your answer, however our idea is to rebuild the classification label given vy the dataset and test the one is better. — Davide Aureli, May 08 '19 at 08:18
Most certainly the classification label is better, unless it was generated by a similar clustering process. It's pretty much impossible to beat supervised approaches in an unsupervised fashion because you don't know what structures you are looking for and there are usually multiple or none. Don't expect magic. — Has QUIT--Anony-Mousse, May 16 '19 at 05:39

How remove duplicates from a dataframe and create new one with the weight for each sample?

1 Answers1