Best way to feature select using PCA (discussion)

Question

Terminology:

Component: PC

loading-score[i,j]: the j feature in PC[i]

Question:

I know the question regarding feature selection is asked several times here at StackOverflow (SO) and on other tech-pages, and it proposes different answers/discussion. That is why I want to open a discussion for the different solutions, and not post it as a general question since that has been done.

Different methods are proposed for feature selection using PCA: For instance using the dot product between the original features and the components (here) to get their correlation, a discussion at SO here suggests that you can only talk about important features as loading-scores in a component (and not use that importance in the input space), and another discussion at SO (which I cannot find at the moment) suggest that the importance for feature[j] would be abs(sum(loading_score[:,j]) i.e the sum of the absolute value of loading_score[i,j] for all i components.

I personally would think that a way to get the importance of a feature would be an absolute sum where each loading_score[i,j] is weighted by the explained variance of component i i.e

imp_feature[j]=sum_i (abs(loading_score[i,j])*explained_variance[i].

GPrathap · Answer 1 · 2020-05-30T13:27:50.787

Well, there is no universal way to select features; it totally depends on the dataset and some insights available about the dataset. I will provide some examples which might be helpful.

Since you asked about PCA, initially it separates the whole dataset into two sets under which the variances. On the other ICA (Independent Component Analysis) is able to extract multiple features simultaneously. Look at this example,

In this example, we mix three independent signals and try to separate out them using ICA and PCA. In this case, ICA is doing it a better way than PCA. In general, if you search Blind Souce Separation (BSS) you may find more information about this. Besides, in this example, we know the independent components thus, separation is easy. In general, we do not know the number of components. Hence, you may have to guess based on some prior information about the dataset. Also, you may use LDA (Linear Discriminate Analysis) to reduce the number of features.

Once you extract PC components using any of the techniques, following way we can visualize it. If we assume, those extracted components as random variables i.e., x, y, z

More information about you may refer to this original source where I took about two figures.

Coming back to your proposition,

imp_feature[j]=sum_i (abs(loading_score[i,j])*explained_variance[i]

I would not recommend this way due to the following reasons: abs(loading_score[i,j]) when we get absolute values you may loose positive or negative correlations of considered features. explained_variance[i] may be used to find the correlation between features, but multiplying does not make any sense.

Edit: In PCA, each component has its explained variance. Explained variance is the ratio between individual component variance and total variance(sum of all individual components variances). Feature significance can be measured by magnitude of explained variance.

All in all, what I want to say, feature selection totally depends on the dataset and the significance of features. PCA is just one technique. Frist understand the properties of features and the dataset. Then, try to extract features. Hope this helps. If you can provide us with an exact example, we may provide more insights.

The idea using `abs(loading_score)` would just be to the get importance. I know we are ignoring the correlation, but that doesnt matter (since I just look at the loading scores with the greatest impact). The multiplying would be a weighted importance, since having a `loading_score` of 0.8 in the PC that explains 0.8 variance by it self, would contribute more to the explained variance than the `loading_score` of 0.99 in the PC describing 0.0002 variance. That was the thoughts behind the calculation — CutePoison, May 30 '20 at 08:45
Yeap, it does make sense if direction of correlation does not mater. But variance means deviation around the it's expectation. I would think loading score and its variance are inversely proportional. Isn't it? — GPrathap, May 30 '20 at 09:27
I would think it is the over way round: a (absolute) higher loading contributes more to the component? Please correct me if I'm wrong — CutePoison, May 30 '20 at 09:40
I've looked into reference which was mentioned. In nutshell, PCA is trying to separate two components where the first component orthogonal to the second component based on the explained variance and keep doing it till the given number of components. Yeah, in that sense higher variance means the main directions of features. — GPrathap, May 30 '20 at 13:18
I added more clarifications to the answer. The topic says about feature selection and but actually you are looking for feature significant which depends on feature selection technique. — GPrathap, May 30 '20 at 13:31

Best way to feature select using PCA (discussion)

1 Answers1