0

I have two dataframes that I need to clusterize where I am trying to do the following:

  1. Apply PCA to remove outliers and use PCA with 3 components to visualize it.I am using a total of explained variance of 97,5% for the outlier removal process.
  2. Inverse transform and get the MSE score between the inversed tranformed dataframes and the original ones.
  3. Use the IQR upper bracket limit using the calculated MSE score to remove the outliers.
  4. Applying the PCA with 3 components to visualize and determine the number of clusters on the new dataframe.

My main issues are:

Is the IQR on MSE a good criteria for removal?

I have limited to the upper bracket since we are working with absolute values. If not and I am mixing concepts, what would be a good criteria for this type of transformation?

Or I should drop PCA and go for other methods of outliers detection, if so which?

And ultimately I still visualize points very far from the clusters when doing the x,y,z plot, does this mean they aren't outliers, just a few scattered far away points that represent a small cluster? Or the outlier detecting isn't being effective?

Finally on the second dataframe a 3D visualization has roughly 40% of explained variance, is it fair to apply the same decision making process?

Ricardo
  • 1
  • 2

1 Answers1

1

The pca library provides functionalities that can be of use for vizualization, outlier detection, playing with explained variance. In general, the Hotelling T2 test and SPE/dmodx are techniques used to remove outliers when using PCA. A previous post with outlier detection can be found here: https://stackoverflow.com/a/63043840/13730780

But in general, if your aim is to detect outliers, it depends on the type of data you have (continuous, categorical, one-hot, mixed datasets), whether you want/need to include context. If you approach is by clustering, you can try the clusteval library which includes methods such as dbscan.

erdogant
  • 1,544
  • 14
  • 23
  • 1
    I ended up migrating to UMAP for outlier detections and concluding that both my datasets didn't have meaningful outliers. However since this question arose due to a clustering problem and I migrated from K-means to K-medoids outliers lost relevance as well. – Ricardo Jul 24 '20 at 13:00