Incremental PCA with large csv

Question

I intend to apply inremntal PCA on a large file so i got this SO thread as a help Python PCA on Matrix too large to fit into memory

in line with this thread i tried to process my sample file first with this code

from sklearn.decomposition import IncrementalPCA
import csv
import sys
import numpy as np
import pandas as pd

chunksize_ = 3 * 100
dimensions = 300
cols = [i for i in range(1, 5502)]

reader = pd.read_csv("D:\PHD\obranking\\demo.csv", usecols=cols, chunksize = chunksize_)
sklearn_pca = IncrementalPCA(n_components=dimensions)

for chunk in reader:
    #y = chunk.pop('y')
    sklearn_pca.partial_fit(chunk)

# Computed mean per feature
mean = sklearn_pca.mean_
# and stddev
stddev = np.sqrt(sklearn_pca.var_)

Xtransformed = None
for chunk in pd.read_csv("D:\PHD\obranking\\demo.csv",usecols=cols, chunksize = chunksize_):
    #y = chunk.pop('y')
    Xchunk = sklearn_pca.transform(chunk)
    if Xtransformed == None:
        Xtransformed = Xchunk
    else:
        Xtransformed = np.vstack((Xtransformed, Xchunk))'

But getting error as

    "C:\Users\Rahul Gupta\PycharmProjects\CSVLearn\venv\Scripts\python.exe" "C:/Users/Rahul Gupta/PycharmProjects/CSVLearn/PCALearn.py"
C:\Users\Rahul Gupta\PycharmProjects\CSVLearn\venv\lib\site-packages\sklearn\decomposition\_incremental_pca.py:309: RuntimeWarning: Mean of empty slice.
  explained_variance[self.n_components_:].mean()
C:\Users\Rahul Gupta\PycharmProjects\CSVLearn\venv\lib\site-packages\numpy\core\_methods.py:161: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)

Process finished with exit code 0

it seems to me that 0th column which is class label is accessed which is having one of the class label as BW. But the used usecols to specifically select columns in the index range(1, 5502) though in csv they are having strings as their names. I also commented out y = chunk.pop('y') as i was not able to understand its use in the link provided. Please also guide me what this line of code does and also what is the problem with my code.

Where exactly does the error pop up? Please update the question with the full error trace. Also, **remove** one of the 2 `for` loops from your code (the one not relevant to the error here). — desertnaut, Apr 18 '20 at 16:09
You mean, your first loop `for chunk in reader:` runs OK, and your second loop `for chunk in pd.read_csv():` produces this error? — desertnaut, Apr 18 '20 at 16:47
@desertnaut using the lead given by you i modified my second loop and used `usecols` in `pd.read_csv()` which i have not used before now only getting warnings. so i am now editing the code as well as error traceback for what i am getting now. kindly suggest further — Rahul Gupta, Apr 19 '20 at 03:32
Try printing the length of each chunk...seems like the last chunk is perhaps too small? — ec2604, Apr 19 '20 at 09:22
I had these warnings pop up too in my IncrementalPCA code with CSV data. They come up during calculation of the `noise_variance_` attribute, which is not the main part of PCA. At some point, during calls to `.partial_fit()`, there is an empty array created with the slice code snippet printed with the warning, which results in both of the warnings being generated. The second warning is generated when trying to calculate the mean of an empty array. — AlexK, Jun 06 '21 at 06:59

Incremental PCA with large csv

0 Answers0