kmeans on a large dataset by running multiple subsamples

Question

I have a dataset with 50+mn rows and 2 columns on which I want to apply kmeans splitting into 4 clusters. I keep running into memory issues (unexplained R-studio and PC crashes) when using kmeans. I tried using bigkmeans but am getting a std:bad_alloc error.

So next I would like to create say 5 or 10 random samples of maybe 2 mn rows of this data and run kmeans on each and put the results into a single dataframe.

There is probably a way to do this elegantly with apply or something similar but I am not familiar with that and so looking for some help.

Here is how I would do this once.

df_sample <- df[sample(nrow(df),2000000),]


k4_s1 <- kmeans(df_sample,iter.max = 50,centers = 4, nstart = 50)

I could put it in a for loop but there is probably something more efficient and any help is appreciated.

You will get better help if you format your code using ` symbol and provide a reproducible example using any means here : [how-to-make-a-great-r-reproducible-example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) — cbo, Oct 29 '19 at 16:30
If you have so many points maybe quantile estimation for your groups is a better approach. — cbo, Oct 29 '19 at 16:58
In each group, the 4 groups will be defined differently although they will probably be similar. You cannot combine the groups directly since sample 1 group 1 may be most similar to sample 2 group 3. You need the centroids for each group for each sample to see if they are similar. What are you planning to do with the four groups? Is this just a convenient way to subdivide the data? — dcarlson, Oct 29 '19 at 21:51
i can sort the 4 centroid based on sum of the zscores or something and that should give me aligned centroids across the various samples. i was thinking of simply averaging at that point — NoNameMLer, Nov 15 '19 at 15:12

kmeans on a large dataset by running multiple subsamples

0 Answers0