I have two dataframes that I need to clusterize where I am trying to do the following:
- Apply PCA to remove outliers and use PCA with 3 components to visualize it.I am using a total of explained variance of 97,5% for the outlier removal process.
- Inverse transform and get the MSE score between the inversed tranformed dataframes and the original ones.
- Use the IQR upper bracket limit using the calculated MSE score to remove the outliers.
- Applying the PCA with 3 components to visualize and determine the number of clusters on the new dataframe.
My main issues are:
Is the IQR on MSE a good criteria for removal?
I have limited to the upper bracket since we are working with absolute values. If not and I am mixing concepts, what would be a good criteria for this type of transformation?
Or I should drop PCA and go for other methods of outliers detection, if so which?
And ultimately I still visualize points very far from the clusters when doing the x,y,z plot, does this mean they aren't outliers, just a few scattered far away points that represent a small cluster? Or the outlier detecting isn't being effective?
Finally on the second dataframe a 3D visualization has roughly 40% of explained variance, is it fair to apply the same decision making process?