How to Systematically Tune UMAP Hyperparameters for Supervised Learning

Question

I have a question about using Uniform Manifold Approximation and Projection (UMAP) for feature extraction.

In my project I am using two tabular datasets both containing around 10000 samples. One has 20 features and the other has 550 features.

My goal is to apply UMAP on each of these datasets and extract predictive features that can be used for a binary classification task where I have labels for each sample. So, the features extracted from UMAP will be used as input to classifiers such as random forest, xgboost, elastic net. The hyperparameters of these classifiers are tuned based on AUROC score on the validation data.

But I want to know what would be the optimal UMAP hyperparameter values for each of my datasets separately so that the outputs from UMAPs will be more predictive in my task.

Is there any approach or any metric I can check while doing a grid-search for UMAP hyperparameters? I also considered combining UMAP with classifiers and tune them together based on AUROC score that I calculate on validation data. But it increases the tuning time a lot due to increased number of combinations.

It would be great if any of you found an approach for a similar usage of UMAP.

Thanks in advance for your help!

In order to get an idea about output of UMAP, I applied hierarchical clustering on it and checked silhouette score if the output of UMAP is clusterable. But still in clusterings with high silhouette scores, the samples were not clustered in a way that labels were separated clearly.

How to Systematically Tune UMAP Hyperparameters for Supervised Learning

0 Answers0