3

I am getting different results when Randomized PCA with sparse and dense matrices:

import numpy as np
import scipy.sparse as scsp
from sklearn.decomposition import RandomizedPCA

x = np.matrix([[1,2,3,2,0,0,0,0],
               [2,3,1,0,0,0,0,3],
               [1,0,0,0,2,3,2,0],
               [3,0,0,0,4,5,6,0],
               [0,0,4,0,0,5,6,7],
               [0,6,4,5,6,0,0,0],
               [7,0,5,0,7,9,0,0]])

csr_x = scsp.csr_matrix(x)

s_pca = RandomizedPCA(n_components=2)
s_pca_scores = s_pca.fit_transform(csr_x)
s_pca_weights = s_pca.explained_variance_ratio_

d_pca = RandomizedPCA(n_components=2)
d_pca_scores = s_pca.fit_transform(x)
d_pca_weights = s_pca.explained_variance_ratio_

print 'sparse matrix scores {}'.format(s_pca_scores)
print 'dense matrix scores {}'.format(d_pca_scores)
print 'sparse matrix weights {}'.format(s_pca_weights)
print 'dense matrix weights {}'.format(d_pca_weights)

Result:

sparse matrix scores [[  1.90912166   2.37266113]
 [  1.98826835   0.67329466]
 [  3.71153199  -1.00492408]
 [  7.76361811  -2.60901625]
 [  7.39263662  -5.8950472 ]
 [  5.58268666   7.97259172]
 [ 13.19312194   1.30282165]]
dense matrix scores [[-4.23432815  0.43110596]
 [-3.87576857 -1.36999888]
 [-0.05168291 -1.02612363]
 [ 3.66039297 -1.38544473]
 [ 1.48948352 -7.0723618 ]
 [-4.97601287  5.49128164]
 [ 7.98791603  4.93154146]]
sparse matrix weights [ 0.74988508  0.25011492]
dense matrix weights [ 0.55596761  0.44403239]

The dense version gives the results with normal PCA, but what is going on when the matrix is sparse? Why are results different?

Akavall
  • 82,592
  • 51
  • 207
  • 251

1 Answers1

7

In the case of the sparse data, the RandomizedPCA does not center the data (mean removal) as it might blow up the memory usage. That probably explains what you observe.

I agree this "feature" is poorly documented. Please feel free to report an issue on github to track it and improve the doc.

Edit: we fixed that discrepancy in scikit-learn 0.15: RandomizedPCA is not deprecated for sparse data. Instead use TruncatedSVD that does the same as PCA without trying to center the data.

ogrisel
  • 39,309
  • 12
  • 116
  • 125
  • I think it makes sense not to demean to do memory usage, thanks for that, but demeaning would not affect the weights (eigenvalues), demeaning only affects the scores. However, my eigenvalues are significantly different, so there must be something else going on. – Akavall May 21 '13 at 23:35