1

I am trying to do PCA for dimension reduction in WEKA (Classification Problem).

I have 200 attributes in my data and close to 2100 rows.

Here are the steps that i follow

  • Import csv file in WEKA explorer

  • In preprocess tab, apply, Normalize data (To bring entire data in range of [0,1]

  • Then implement PCA.

    • In options for PCA, there is an option for centerData which if set to False, would calculate using correlation matrix after standardizing data (Correct me if i am wrong) and if set to true would using covariance matrix.

My doubt is

  1. Should i be normalizing data before implementing PCA or not? I tried doing it before and after normalizing i am getting different results. So i am confused.
  2. Should i Standardize data (bring mean to 0) and then apply PCA.

What is the option that i should select in PCA WEKA for centerData option in either case?

Neil
  • 1,715
  • 6
  • 30
  • 45

1 Answers1

7

This question has been answered in part here: PCA first or normalization first?

To answer your questions directly:

Normalizing would be a personal choice. If you set centerData=TRUE, and do not normalize or standardize your data, some attributes with large values will have greater influence in the PCA. If you set centerData=FALSE, Weka standardizes the data for you.

And just to confirm your suspicions, in Weka, centerData does the following:

centerData=TRUE

  • Centers your data (does not normalize or standardize, so if you decide to do that, you need to do it before)
  • PCA is performed with the covariance matrix

centerData=FALSE

  • PCA is performed with the correlation matrix (data is standardized by the method)
Community
  • 1
  • 1
Walter
  • 2,811
  • 2
  • 21
  • 23
  • Thanks @Walter I am still trying to figure out which one would suit best for my dataset, as i could see deviation of few percent (2-3 %) in accuracy when trying above options. P.S - Out of 200 attributes, around 180-185 attributes are already in [0-1] range. Problem is because of other remaining attributes. – Neil Oct 17 '13 at 04:07
  • 1
    That is understandable. You have to do what makes the most sense for your data! However, keep in mind that the 2-3% deviation in accuracy could simply be an artifact of your testing method (possible overfitting). – Walter Oct 17 '13 at 04:17