Doing PCA in Weka

Question

I am trying to do PCA for dimension reduction in WEKA (Classification Problem).

I have 200 attributes in my data and close to 2100 rows.

Here are the steps that i follow

Import csv file in WEKA explorer
In preprocess tab, apply, Normalize data (To bring entire data in range of [0,1]
Then implement PCA.
- In options for PCA, there is an option for centerData which if set to False, would calculate using correlation matrix after standardizing data (Correct me if i am wrong) and if set to true would using covariance matrix.

My doubt is

Should i be normalizing data before implementing PCA or not? I tried doing it before and after normalizing i am getting different results. So i am confused.
Should i Standardize data (bring mean to 0) and then apply PCA.

What is the option that i should select in PCA WEKA for centerData option in either case?

score 7 · Accepted Answer · edited May 23 '17 at 12:22

7

This question has been answered in part here: PCA first or normalization first?

To answer your questions directly:

Normalizing would be a personal choice. If you set centerData=TRUE, and do not normalize or standardize your data, some attributes with large values will have greater influence in the PCA. If you set centerData=FALSE, Weka standardizes the data for you.

And just to confirm your suspicions, in Weka, centerData does the following:

centerData=TRUE

Centers your data (does not normalize or standardize, so if you decide to do that, you need to do it before)
PCA is performed with the covariance matrix

centerData=FALSE

PCA is performed with the correlation matrix (data is standardized by the method)

edited May 23 '17 at 12:22

Community

1
1

answered Oct 16 '13 at 16:12

Walter

2,811
2
21
23

Thanks @Walter I am still trying to figure out which one would suit best for my dataset, as i could see deviation of few percent (2-3 %) in accuracy when trying above options. P.S - Out of 200 attributes, around 180-185 attributes are already in [0-1] range. Problem is because of other remaining attributes. – Neil Oct 17 '13 at 04:07
1

That is understandable. You have to do what makes the most sense for your data! However, keep in mind that the 2-3% deviation in accuracy could simply be an artifact of your testing method (possible overfitting). – Walter Oct 17 '13 at 04:17

Doing PCA in Weka

1 Answers1