I recently got this dataset which is too large for my RAM. I have to read it in chunks using
pd.read_csv('filename.csv', chunksize=1024)
And all the labels in the data set are continuous, i.e. all the zeros are together, and ones, and twos. There are 12000 of each label, so each chunk has all zeros or ones or twos.
The problem that I have is even if I use randomize
and test_train_split
, I still get all same labels in my train data.
As a result of this, my model learns to output one value for any input.
The constant output depends on the random seed. I need to know how to fix this error.
EDIT: Here is the code as requested
data_in_chunks = pd.read_csv(data_file, chunksize=4096)
data = next(iter(data_in_chunks)
X = data.drop(['labels'], axis=1)
Y = data.labels
X_train, X_val, Y_train, Y_val = train_test_split(X, Y, stratify=Y, random_state=0) # train test random state has no effect
for i in iter(data_in_chunks):
train(i) # this is just simplified i used optim in the actual code
so to explain the problem in other words, 4096 is the highest chunksize
my 16 gigs of ram can handle, and due to the sequential nature of all the labels, all my Y_train
, and Y_test
has only 0, or 1 or 2 (all the possible outputs)
Please help Thanks in advance