58

In my understanding, I thought PCA can be performed only for continuous features. But while trying to understand the difference between onehot encoding and label encoding came through a post in the following link:

When to use One Hot Encoding vs LabelEncoder vs DictVectorizor?

It states that one hot encoding followed by PCA is a very good method, which basically means PCA is applied for categorical features. Hence confused, please suggest me on the same.

Community
  • 1
  • 1
data_person
  • 4,194
  • 7
  • 40
  • 75
  • I would like to ask if the following articles approach to converting categorical variables to numeric by summing their ASCII byte representation is a good idea? http://blog.davidvassallo.me/2015/10/28/data-mining-firewall-logs-principal-component-analysis/ – javid Nov 18 '19 at 23:23

7 Answers7

80

I disagree with the others.

While you can use PCA on binary data (e.g. one-hot encoded data) that does not mean it is a good thing, or it will work very well.

PCA is designed for continuous variables. It tries to minimize variance (=squared deviations). The concept of squared deviations breaks down when you have binary variables.

So yes, you can use PCA. And yes, you get an output. It even is a least-squared output: it's not as if PCA would segfault on such data. It works, but it is just much less meaningful than you'd want it to be; and supposedly less meaningful than e.g. frequent pattern mining.

Nick Cox
  • 35,529
  • 6
  • 31
  • 47
Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • 8
    Could someone explain _why_ the concept of variance breaks down with binary variables? (I understand that it is redundant with expected value, but it stil conveys some sense of spread, does it not?) Further, since PCA is based on the decomposition of the variance-covariance matrix, does the fact that **variance** breaks down with binary variables also means that **covariance** between a binary variable and any other kind of variable is meaningless? – Arthur Jul 01 '20 at 18:33
  • 4
    Any alternatives to PCA more suitable for one-hot encoded categorical data? – a06e Dec 03 '21 at 09:39
  • 1
    @becko You could consider [multiple correspondence analysis](https://en.wikipedia.org/wiki/Multiple_correspondence_analysis). – Galen Aug 17 '22 at 22:52
15

MCA is a known technique for categorical data dimension reduction. In R there is a lot of package to use MCA and even mix with PCA in mixed contexts. In python exist a a mca library too. MCA apply similar maths that PCA, indeed the French statistician used to say, "data analysis is to find correct matrix to diagonalize"

http://gastonsanchez.com/visually-enforced/how-to/2012/10/13/MCA-in-R/

joscani
  • 971
  • 7
  • 6
9

Basically, PCA finds and eliminate less informative (duplicate) information on feature set and reduce the dimension of feature space. In other words, imagine a N-dimensional hyperspace, PCA finds such M (M < N) features that the data variates most. In this way data may be represented as M-dimensional feature vectors. Mathematically, it is some-kind of a eigen-values & eigen vectors calculation of a feature space.

So, it is not important whether the features are continuous or not.

PCA is used widely on many application. Mostly for eliminating noisy, less informative data that comes from some sensor or hardware before classification/recognition.

Edit:

Statistically speaking, categorical features can be seen as discrete random variables in interval [0,1]. Computation for expectation E{X} and variance E{(X-E{X})^2) are still valid and meaningful for discrete rvs. I still stand for the applicability of PCA in case of categorical features.

Consider a case where you would like to predict whether "It is going to rain for a given day or not". You have categorical feature X which is "Do I have to go to work for the given day", 1 for yes and 0 for no. Clearly weather conditions do not depend on our work schedule, so P(R|X)=P(R). Assuming 5 days of work for every week, we have more 1s than 0s for X in our randomly collected dataset. PCA would probably lead to dropping this low-variance dimension in your feature representation.

At the end of the day, PCA is for dimension reduction with minimal loss of information. Intuitively, we rely on variance of the data on a given axis to measure its usefulness for the task. I don't think there is any theoretical limitation for applying it to categorical features. Practical value depends on application and data which is also the case for continuous variables.

Ockhius
  • 528
  • 3
  • 10
  • 3
    Well it kinda boils down to calculating the eigenvectors of the covariance matrix, thus having binary data (e.g one-hot) how would you interpret the distance to the mean from a binary point? – CutePoison Feb 12 '19 at 07:40
5

The following publication shows great and meaningful results when computing PCA on categorical variables treated as simplex vertices:

Niitsuma H., Okada T. (2005) Covariance and PCA for Categorical Variables. In: Ho T.B., Cheung D., Liu H. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2005. Lecture Notes in Computer Science, vol 3518. Springer, Berlin, Heidelberg

https://doi.org/10.1007/11430919_61

It is available via https://arxiv.org/abs/0711.4452 (including as a PDF).

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Oleg Melnikov
  • 3,080
  • 3
  • 34
  • 65
  • 1
    Why did you roll back that edit? This paper was published in 2005, in spite of the 2018 date at the top. See https://arxiv.org/abs/0711.4452, which is the source of the PDF you link to (*submitted 2007* certainly means it can’t have been published after that point). And, much more importantly, the [citation reference for this paper](https://doi.org/10.1007/11430919_61) is 100% clear this is a work form 2005. – Martijn Pieters Mar 14 '20 at 12:34
2

In this paper, the author's use PCA to combine categorical features of high cardinality. If I understood correctly, they first calculate conditional probabilities for each target class. Then they choose a threshold hyperparameter and create a new binary variable for each conditional class probability for each categorical feature to be combined. PCA is performed to combine the new binary variables with the number of components retained specified as a hyperparameter.

michen00
  • 764
  • 1
  • 8
  • 32
-1

PCA is a dimensionality reduction method that can be applied any set of features. Here is an example using OneHotEncoded (i.e. categorical) data:

from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()
X = enc.fit_transform([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]]).toarray()

print(X)

> array([[ 1.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  1.],
       [ 0.,  1.,  0.,  1.,  0.,  1.,  0.,  0.,  0.],
       [ 1.,  0.,  0.,  0.,  1.,  0.,  1.,  0.,  0.],
       [ 0.,  1.,  1.,  0.,  0.,  0.,  0.,  1.,  0.]])


from sklearn.decomposition import PCA
pca = PCA(n_components=3)
X_pca = pca.fit_transform(X)

print(X_pca)

> array([[-0.70710678,  0.79056942,  0.70710678],
       [ 1.14412281, -0.79056942,  0.43701602],
       [-1.14412281, -0.79056942, -0.43701602],
       [ 0.70710678,  0.79056942, -0.70710678]])
Alex
  • 12,078
  • 6
  • 64
  • 74
  • thanks for the detailed explanation. Can u please suggest me on how to intrepret the results of one hot encoder in your code. – data_person Nov 24 '16 at 23:55
  • 2
    If I recall correctly, the PCA algorithm projects the features onto a different space by solving for the eigenvectors and eigenvalues. Then it looks at the top N (3 in this case) largest eigenvalues and takes those eigenvector components. The idea is to encode the most useful data in fewer features. – Alex Nov 25 '16 at 00:12
  • Oh you were asking about the one hot encoder ... There are two options for feature 1 (0 and 1), three options for feature 2 (0, 1, and 2) and four options for feature 3 (0, 1, 2, and 3). That totals up to 9 options, and hence why we have 9 one hot encoded features. Hopefully that gets you thinking along the right lines to understand what is happening. – Alex Nov 25 '16 at 00:16
  • actually I meant what u answered me at first:) – data_person Nov 25 '16 at 00:40
  • 1
    You are hiding under the carpet the fact that what you call "encoding" of categorical variable is essentially still a binary representation therof, therefore even if you can apply PCA on it this doesn't necessarily mean that it makes sense. – gented Dec 21 '18 at 15:52
  • 11
    You CAN apply PCA when one-hot encoding - the question is if it makes sense? – CutePoison Feb 12 '19 at 07:42
-5

I think pca is reducing var by leverage the linear relation between vars. If there's only one categoral var coded in onehot, there's not linear relation between the onehoted cols. so it can't reduce by pca.

But if there exsits other vars, the onehoted cols may be can presented by linear relation of other vars.

So may be it can reduce by pca, depends on the relation of vars.

NicolasLi
  • 97
  • 1
  • 3