4

When I run this code with sklearn.__version__ 0.15.0 I get a strange result:

import numpy as np
from scipy import sparse
from sklearn.decomposition import RandomizedPCA

a = np.array([[1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
              [1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
              [0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
              [1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
              [1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
              [1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
              [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0],
              [1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
              [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]])

s = sparse.csr_matrix(a)

pca = RandomizedPCA(n_components=20)
pca.fit_transform(s)

With 0.15.0 I get:

>>> pca.explained_variance_ratio_.sum()
>>> 2.1214285714285697

with '0.14.1' I get:

>>> pca.explained_variance_ratio_.sum()
>>> 0.99999999999999978

The sum should not be greater than 1

Percentage of variance explained by each of the selected components. k is not set then all components are stored and the sum of explained variances is equal to 1.0

What is going on here?

Akavall
  • 82,592
  • 51
  • 207
  • 251
  • looks like a bug to me, you could post an [issue](https://github.com/scikit-learn/scikit-learn/issues) – EdChum Jul 22 '14 at 07:01
  • 1
    I added one here: https://github.com/scikit-learn/scikit-learn/issues/3469 . Can you post the version info for numpy, scipy, and your linear algebra (BLAS/ATLAS/LAPACK/etc.) libraries as well? – Kyle Kastner Jul 22 '14 at 07:43
  • The behavior in 0.14.1 was a bug as the sum was always 1.0 whatever the truncation. The fact that the explained variance is larger than 1.0 in 0.15.0 is also a bug. But a different one... – ogrisel Jul 22 '14 at 08:37

1 Answers1

4

The behavior in 0.14.1 is a bug as its explained_variance_ratio_.sum() used to always return 1.0 irrespective of the number of components to extract (the truncation). In 0.15.0 this was fixed for dense arrays as the following demonstrates:

>>> RandomizedPCA(n_components=3).fit(a).explained_variance_ratio_.sum()
0.86786547849848206
>>> RandomizedPCA(n_components=4).fit(a).explained_variance_ratio_.sum()
0.95868429631268515
>>> RandomizedPCA(n_components=5).fit(a).explained_variance_ratio_.sum()
1.0000000000000002

Your data has rank 5 (100% of the variance is explained by 5 components).

If you try to call RandomizedPCA on a sparse matrix you will get:

DeprecationWarning: Sparse matrix support is deprecated and will be dropped in 0.16. Use TruncatedSVD instead.

The use of RandomizedPCA on sparse data is incorrect as we cannot center the data without breaking the sparsity which can blow up the memory on realistically sized sparse data. However centering is required for PCA.

TruncatedSVD will give you correct explained variance ratios on sparse data (but keep in mind that it does not do exactly the same thing as PCA on dense data):

>>> TruncatedSVD(n_components=3).fit(s).explained_variance_ratio_.sum()
0.67711305361490826
>>> TruncatedSVD(n_components=4).fit(s).explained_variance_ratio_.sum()
0.8771350212934137
>>> TruncatedSVD(n_components=5).fit(s).explained_variance_ratio_.sum()
0.95954459082530097
Fred Foo
  • 355,277
  • 75
  • 744
  • 836
ogrisel
  • 39,309
  • 12
  • 116
  • 125
  • I believe this answer answers the `explained_variance_ratio_` part of my other question http://stackoverflow.com/questions/16660771/different-results-when-using-sklearn-randomizedpca-with-sparse-and-dense-matrice, do you mind adding this info to that answer (that's your answer) so I can accept it? – Akavall Jul 22 '14 at 13:21