Different results when using sklearn RandomizedPCA with sparse and dense matrices

Question

I am getting different results when Randomized PCA with sparse and dense matrices:

import numpy as np
import scipy.sparse as scsp
from sklearn.decomposition import RandomizedPCA

x = np.matrix([[1,2,3,2,0,0,0,0],
               [2,3,1,0,0,0,0,3],
               [1,0,0,0,2,3,2,0],
               [3,0,0,0,4,5,6,0],
               [0,0,4,0,0,5,6,7],
               [0,6,4,5,6,0,0,0],
               [7,0,5,0,7,9,0,0]])

csr_x = scsp.csr_matrix(x)

s_pca = RandomizedPCA(n_components=2)
s_pca_scores = s_pca.fit_transform(csr_x)
s_pca_weights = s_pca.explained_variance_ratio_

d_pca = RandomizedPCA(n_components=2)
d_pca_scores = s_pca.fit_transform(x)
d_pca_weights = s_pca.explained_variance_ratio_

print 'sparse matrix scores {}'.format(s_pca_scores)
print 'dense matrix scores {}'.format(d_pca_scores)
print 'sparse matrix weights {}'.format(s_pca_weights)
print 'dense matrix weights {}'.format(d_pca_weights)

Result:

sparse matrix scores [[  1.90912166   2.37266113]
 [  1.98826835   0.67329466]
 [  3.71153199  -1.00492408]
 [  7.76361811  -2.60901625]
 [  7.39263662  -5.8950472 ]
 [  5.58268666   7.97259172]
 [ 13.19312194   1.30282165]]
dense matrix scores [[-4.23432815  0.43110596]
 [-3.87576857 -1.36999888]
 [-0.05168291 -1.02612363]
 [ 3.66039297 -1.38544473]
 [ 1.48948352 -7.0723618 ]
 [-4.97601287  5.49128164]
 [ 7.98791603  4.93154146]]
sparse matrix weights [ 0.74988508  0.25011492]
dense matrix weights [ 0.55596761  0.44403239]

The dense version gives the results with normal PCA, but what is going on when the matrix is sparse? Why are results different?

ogrisel · Accepted Answer · 2014-07-23T11:42:15.463

7

In the case of the sparse data, the RandomizedPCA does not center the data (mean removal) as it might blow up the memory usage. That probably explains what you observe.

I agree this "feature" is poorly documented. Please feel free to report an issue on github to track it and improve the doc.

Edit: we fixed that discrepancy in scikit-learn 0.15: RandomizedPCA is not deprecated for sparse data. Instead use TruncatedSVD that does the same as PCA without trying to center the data.

edited Jul 23 '14 at 11:42

answered May 21 '13 at 08:49

ogrisel

39,309
12
116
125

I think it makes sense not to demean to do memory usage, thanks for that, but demeaning would not affect the weights (eigenvalues), demeaning only affects the scores. However, my eigenvalues are significantly different, so there must be something else going on. – Akavall May 21 '13 at 23:35

Different results when using sklearn RandomizedPCA with sparse and dense matrices

1 Answers1

Linked