I intend to apply inremntal PCA on a large file so i got this SO thread as a help Python PCA on Matrix too large to fit into memory
in line with this thread i tried to process my sample file first with this code
from sklearn.decomposition import IncrementalPCA
import csv
import sys
import numpy as np
import pandas as pd
chunksize_ = 3 * 100
dimensions = 300
cols = [i for i in range(1, 5502)]
reader = pd.read_csv("D:\PHD\obranking\\demo.csv", usecols=cols, chunksize = chunksize_)
sklearn_pca = IncrementalPCA(n_components=dimensions)
for chunk in reader:
#y = chunk.pop('y')
sklearn_pca.partial_fit(chunk)
# Computed mean per feature
mean = sklearn_pca.mean_
# and stddev
stddev = np.sqrt(sklearn_pca.var_)
Xtransformed = None
for chunk in pd.read_csv("D:\PHD\obranking\\demo.csv",usecols=cols, chunksize = chunksize_):
#y = chunk.pop('y')
Xchunk = sklearn_pca.transform(chunk)
if Xtransformed == None:
Xtransformed = Xchunk
else:
Xtransformed = np.vstack((Xtransformed, Xchunk))'
But getting error as
"C:\Users\Rahul Gupta\PycharmProjects\CSVLearn\venv\Scripts\python.exe" "C:/Users/Rahul Gupta/PycharmProjects/CSVLearn/PCALearn.py"
C:\Users\Rahul Gupta\PycharmProjects\CSVLearn\venv\lib\site-packages\sklearn\decomposition\_incremental_pca.py:309: RuntimeWarning: Mean of empty slice.
explained_variance[self.n_components_:].mean()
C:\Users\Rahul Gupta\PycharmProjects\CSVLearn\venv\lib\site-packages\numpy\core\_methods.py:161: RuntimeWarning: invalid value encountered in double_scalars
ret = ret.dtype.type(ret / rcount)
Process finished with exit code 0
it seems to me that 0th column which is class label is accessed which is having one of the class label as BW. But the used usecols to specifically select columns in the index range(1, 5502) though in csv they are having strings as their names.
I also commented out y = chunk.pop('y')
as i was not able to understand its use in the link provided. Please also guide me what this line of code does and also what is the problem with my code.