n) big data sets in R

Asked Dec 03 '18 at 01:19

Active Dec 03 '18 at 03:43

Viewed 476 times

I have a data set of n = 100,000 observations by p = 2 millions variables. I cannot load all the data at once in the memory and the covariance matrix would not fit either (2 millions x 2 millions). Is there a way in R to get most of the relevant principal components (~5,000 to 10,000 I think, explaining 99% of total variation) ?

I am trying to find if there is a good implementation of an iterative algorithm. The packages I found seem either discontinued or for the approximation of the few first principal components.

If there is no package with precompiled algorithms, which iterative algorithm would you suggest to get most of the PCs ? (that I can code myself)

edited Dec 03 '18 at 03:43

asked Dec 03 '18 at 01:19

RemiDav

I had to switch to Python and use sklearn. It seems R doesn't have the necessary tools. – RemiDav Dec 04 '18 at 09:58
Have you looked at `rARPACK` for large scale eigenvalue decomposition? Also, this question seems already to have been asked here: https://stackoverflow.com/questions/12670972/doing-pca-on-very-large-data-set-in-r?rq=1 – Joe Dec 06 '18 at 13:21
Thanks for the suggestion. The question asked was for n>>p and doesn't fit my p>>n problem unfortunately. – RemiDav Dec 07 '18 at 08:50

PCA in large (p>>n) big data sets in R

0 Answers0