11

I have a csv that is 100,000 rows x 27,000 columns that I am trying to do PCA on to produce a 100,000 rows X 300 columns matrix. The csv is 9GB large. Here is currently what I'm doing:

from sklearn.decomposition import PCA as RandomizedPCA
import csv
import sys
import numpy as np
import pandas as pd

dataset = sys.argv[1]
X = pd.DataFrame.from_csv(dataset)
Y = X.pop("Y_Level")
X = (X - X.mean()) / (X.max() - X.min())
Y = list(Y)
dimensions = 300
sklearn_pca = RandomizedPCA(n_components=dimensions)
X_final = sklearn_pca.fit_transform(X)

When I run the above code, my program is killed while doing the .from_csv in step. I've been able to get around that by spliting the csv into sets of 10,000; reading them in 1 by 1, and then calling pd.concat. This allows me to get to the normalization step (X - X.mean()).... before getting killed. Is my data just too big for my macbook air? Or is there a better way to do this. I would really love to use all the data I have for my machine learning application.


If i wanted to use incremental PCA as suggested by the answer below, is this how I would do it?:

from sklearn.decomposition import IncrementalPCA
import csv
import sys
import numpy as np
import pandas as pd

dataset = sys.argv[1]
chunksize_ = 10000
#total_size is 100000
dimensions = 300

reader = pd.read_csv(dataset, sep = ',', chunksize = chunksize_)
sklearn_pca = IncrementalPCA(n_components=dimensions)
Y = []
for chunk in reader:
    y = chunk.pop("virginica")
    Y = Y + list(y)
    sklearn_pca.partial_fit(chunk)
X = ???
#This is were i'm stuck, how do i take my final pca and output it to X,
#the normal transform method takes in an X, which I don't have because I
#couldn't fit it into memory.

I can't find any good examples online.

mt88
  • 2,855
  • 8
  • 24
  • 42

2 Answers2

21

Try to divide your data or load it by batches into script, and fit your PCA with Incremetal PCA with it's partial_fit method on every batch.

from sklearn.decomposition import IncrementalPCA
import csv
import sys
import numpy as np
import pandas as pd

dataset = sys.argv[1]
chunksize_ = 5 * 25000
dimensions = 300

reader = pd.read_csv(dataset, sep = ',', chunksize = chunksize_)
sklearn_pca = IncrementalPCA(n_components=dimensions)
for chunk in reader:
    y = chunk.pop("Y")
    sklearn_pca.partial_fit(chunk)

# Computed mean per feature
mean = sklearn_pca.mean_
# and stddev
stddev = np.sqrt(sklearn_pca.var_)

Xtransformed = None
for chunk in pd.read_csv(dataset, sep = ',', chunksize = chunksize_):
    y = chunk.pop("Y")
    Xchunk = sklearn_pca.transform(chunk)
    if Xtransformed == None:
        Xtransformed = Xchunk
    else:
        Xtransformed = np.vstack((Xtransformed, Xchunk))

Useful link

Ibraim Ganiev
  • 8,934
  • 3
  • 33
  • 52
  • thanks for the response! Do you mind taking a quick look at my implementation above? I can't find good examples online; the one on the link you sent was loading the whole data into memory. – mt88 Aug 24 '15 at 23:24
  • Thanks for the help. Do i need to call transform or some other function after the loop is done? Ultimately I need a 2 dimensional matrix of floats of dimensions 100,000 x 300,000. Will just calling fit give me this or do I need to call transform in some way? Before I had to use fit_trasnformed when my data was small. – mt88 Aug 25 '15 at 18:15
  • My script just finished and it doesn't look like an X matrix is returned. – mt88 Aug 25 '15 at 19:30
  • Added more clarity of where I'm stuck in the question. – mt88 Aug 25 '15 at 19:40
  • @mt88, After you completed pca fitting with partial_fit on all chunks of data you can call transform, if you want to transform your data (reduce dimensionality), maybe by chunks again (in separate for loop, after fitting). After transforming you will get 100k * 300 matrix. You have to call transform after fitting because model must learn on all examples from available dataset, otherwise it will not transform data correctly. That's why you cannot use fit_transform with IncrementalPCA, only partial_fit, fit, and transform. – Ibraim Ganiev Aug 25 '15 at 19:51
  • Instead of using the ugly ifs for whether Xtransformed is empty you can just initialise it with `Xtransformed = np.empty((0,dimensions))` – Tadej Magajna Apr 01 '18 at 14:40
0

PCA needs to compute a correlation matrix, which would be 100,000x100,000. If the data is stored in doubles, then that's 80 GB. I would be willing to bet your Macbook does not have 80 GB RAM.

The PCA transformation matrix is likely to be nearly the same for a reasonably sized random subset.

Don Reba
  • 13,814
  • 3
  • 48
  • 61
  • Thanks for the response! Is there a way to tell RandomizedPCA to use a subset of data rather than all of X? Also, is there a way to tell what reasonably sized would be? Is 10,000 rows good enough? – mt88 Aug 24 '15 at 20:51
  • 2
    27k * 27k, he has only 27k features, correlation matrix means feature to feature correlation. – Ibraim Ganiev Aug 24 '15 at 20:55