Clustering Subsets of a big dataset (2d and multi-dimensional)

Question

How do you cluster a subset of a big dataset? I have a big dataset of ~200000 points, and they are high dimensional data. There are around ~25000 of different meaningful combinations of the points, each containing around 10-200 points, and I would like to assess the clustering properties of those combinations. I have used umap on the high dimensional data to reduce them to 2d, so analyzing umap is appropriate, but analyzing on the original data would be better.

Traditional clustering methods (kmeans, hierarchical clustering and dbscan) could not account for the what is considered a cluster -- the points are located in a small space as supposed to the entire space even in 2d, and they also generally cluster poorly because of the small amount of data because they would specify multiple clusters when those were actually outliers. I have made some progress with the level-set tree method in that regards, but the behavior of the algorithm is not always controllable (only doable for very typical cases). Is there any methods that you would suggest?

Welcome to StackOverflow. Can you make your post [reproducible](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) and include any code you've written so far (even if it results in an error or not the right outcome) with some minimal sample data? — jrcalabrese, Nov 29 '22 at 23:32

Clustering Subsets of a big dataset (2d and multi-dimensional)

0 Answers0