1

Why Sklearn.decomposition.TruncatedSVD's explained variance ratios are not ordered by singular values?

My code is below:

X = np.array([[1,1,1,1,0,0,0,0,0,0,0,0,0,0],
           [0,0,1,1,1,1,1,1,1,0,0,0,0,0],
           [0,0,0,0,0,0,1,1,1,1,1,1,0,0],
           [0,0,0,0,0,0,0,0,0,0,1,1,1,1]])
svd = TruncatedSVD(n_components=4)
svd.fit(X4)
print(svd.explained_variance_ratio_)
print(svd.singular_values_)

and the results:

[0.17693405 0.46600983 0.21738089 0.13967523]
[3.1918354  2.39740372 1.83127499 1.30808033]

I heard that a singular value means how much the component can explain data, so I think explained variance ratios also are followed by the order of singular values. But the ratios are not ordered by descending order.

Can someone explain why does it happen?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
장동엽
  • 73
  • 8

1 Answers1

3

I heard that a singular value means how much the component can explain data

This holds for PCA, but it is not exactly true for (truncated) SVD; quoting from a relevant Github thread back in the day when an explained_variance_ratio_ attribute was not even available for TruncatedSVD (2014 - emphasis mine):

preserving the variance is not the exact objective function of truncated SVD without centering

So, the singular values themselves are indeed sorted in descending order, but this does not hold necessarily for the corresponding explained variance ratios if the data are not centered.

But if we do center the data before, then the explained variance ratios come out sorted in descending order indeed, in correspondence with the singular values themselves:

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import TruncatedSVD

sc = StandardScaler()
Xs = sc.fit_transform(X) # X data from the question here

svd = TruncatedSVD(n_components=4)
svd.fit(Xs)

print(svd.explained_variance_ratio_)
print(svd.singular_values_)

Result:

[4.60479851e-01 3.77856541e-01 1.61663608e-01 8.13905807e-66]
[5.07807756e+00 4.59999633e+00 3.00884730e+00 8.21430014e-17]

For the mathematical & computational differences between centered and non-centered data in PCA & SVD calculations, see How does centering make a difference in PCA (for SVD and eigen decomposition)?


Regarding the use of TruncatedSVD itself, here is user ogrisel again (scikit-learn contributor) in a relevant answer in Difference between scikit-learn implementations of PCA and TruncatedSVD:

In practice TruncatedSVD is useful on large sparse datasets which cannot be centered without making the memory usage explode.

So, it's not exactly clear why you have selected to use TruncatedSVD here, but, if you don't have a too-large dataset that causes memory issues, I guess you should revert to PCA instead.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
  • 1
    Excellent answer! – TayTay May 12 '20 at 15:00
  • @장동엽 You are very welcome; notice that the question had been asked several times in the past (e.g. [here](https://stackoverflow.com/questions/35299061/scikit-learn-truncatedsvds-explained-variance-ratio-not-in-descending-order) and [here](https://stackoverflow.com/questions/54411576/sci-kit-learn-truncatedsvd-explained-variance-ratio-not-in-descending-order)), but without a satisfactory answer so far ;) – desertnaut May 14 '20 at 14:56