Dimension Reduction for Clustering in R (PCA and other methods)

Question

Let me preface this:

I have looked extensively on this matter and I've found several intriguing possibilities to look into (such as this and this). I've also looked into principal component analysis and I've seen some sources that claim it's a poor method for dimension reduction. However, I feel as though it may be a good method, but am unsure how to implement it. All the sources I've found on this matter give a good explanation, but rarely do they provide any sort of advice as to actually go about applying one of these methods (i.e. how one can actually apply a method in R).

So, my question is: is there a clear-cut way to go about dimension reduction in R? My dataset contains both numeric and categorical variables (with multiple levels) and is quite large (~40k observations, 18 variables (but 37 if I transform categorical variables into dummies)).

A few points:

If we want to use PCA, then I would have to somehow convert my categorical variables into numeric. Would it be okay to simply use a dummy variable approach for this?
For any sort of dimension reduction for unsupervised learning, how do I treat ordinal variables? Do the concept of ordinal variables even make sense in unsupervised learning?
My real issue with PCA is that when I perform it and have my principal components.. I have no idea what to actually do with them. From my knowledge, each principal component is a combination of the variables - and as such I'm not really sure how this helps us pick and choose which are the best variables.

I'm not really sure if this belongs here. It seems like this question is more about how to do a proper dimension reduction analysis which is really more of a statistical question which should go on [stats.se] or [datascience.se]. If the problem is really coding this in R, then the question should include a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input data (it should not be your entire data set). Try to separate the parts that directly related to programming and those that are not. — MrFlick, Apr 05 '17 at 15:50

ABCD · Answer 1 · 2017-05-11T04:30:44.630

I don't think this is an R question. This is more like a statistics question.

PCA doesn't work for categorical variables. PCA relies on decomposing the covariance matrix, which doesn't work for categorical variables.
Ordinal variables make lot's of sense in supervised and unsupervised learning. What exactly are you looking for? You should only apply PCA on ordinal variables if they are not skewed and you have many levels.
PCA only gives you a new transformation in terms of principal components, and their eigenvalues. It has nothing to do with dimension reduction. I repeat, it has nothing to do with dimension reduction. You reduce your data set only if you select a subset of the principal components. PCA is useful for regression, data visualisation, exploratory analysis etc.
A common way is to apply optimal scaling to transform your categorical variables for PCA:

Read this:

http://www.sicotests.com/psyarticle.asp?id=159

You may also want to consider correspondence analysis for categorical variables and multiple factor analysis for both categorical and continuous.

Dimension Reduction for Clustering in R (PCA and other methods)

1 Answers1