1

I recently got this dataset which is too large for my RAM. I have to read it in chunks using

pd.read_csv('filename.csv', chunksize=1024)

And all the labels in the data set are continuous, i.e. all the zeros are together, and ones, and twos. There are 12000 of each label, so each chunk has all zeros or ones or twos.

The problem that I have is even if I use randomize and test_train_split, I still get all same labels in my train data. As a result of this, my model learns to output one value for any input. The constant output depends on the random seed. I need to know how to fix this error.

EDIT: Here is the code as requested

data_in_chunks = pd.read_csv(data_file, chunksize=4096)
data = next(iter(data_in_chunks)
X = data.drop(['labels'], axis=1)
Y = data.labels
X_train, X_val, Y_train, Y_val = train_test_split(X, Y, stratify=Y, random_state=0) # train test random state has no effect
for i in iter(data_in_chunks):
    train(i) # this is just simplified i used optim in the actual code

so to explain the problem in other words, 4096 is the highest chunksize my 16 gigs of ram can handle, and due to the sequential nature of all the labels, all my Y_train, and Y_test has only 0, or 1 or 2 (all the possible outputs)

Please help Thanks in advance

arabinelli
  • 1,006
  • 1
  • 8
  • 19
DS3a
  • 99
  • 1
  • 7

1 Answers1

3

You could solve the label order issue by randomly shuffling the .csv on disk with utilities such as https://github.com/alexandres/terashuf - depending on your OS

EDIT

A solution using only pandas and standard libraries can be implemented using the skiprows argument.

import pandas as pd
import random, math

def read_shuffled_chunks(filepath: str, chunk_size: int,
                        file_lenght: int, has_header=True):

    header = 0 if has_header else None
    first_data_idx = 1 if has_header else 0
    # create index list
    index_list = list(range(first_data_idx,file_lenght))

    # shuffle the list in place
    random.shuffle(index_list)

    # iterate through the chunks and read them
    n_chunks = ceil(file_lenght/chunk_size)
    for i in range(n_chunks):

        rows_to_keep = index_list[(i*chunk_size):((i+1)*chunk_size - 1)]
        if has_header:
            rows_to_keep += [0] # include the index row
        # get the inverse selection
        rows_to_skip = list(set(index_list) - set(rows_to_keep)) 
        yield pd.read_csv(filepath,skiprows=rows_to_skip, header=header)

Please note that, although the rows included in each chunk are going to be randomly sampled from the csv, they are read by pandas in their original order. If you are training your model with batches of each data chunk, you might want to consider randomize each subset DataFrame to avoid incurring in the same issue.

arabinelli
  • 1,006
  • 1
  • 8
  • 19
  • is there no way for me to do it using pandas or numpy? – DS3a Apr 29 '20 at 12:04
  • I guess you could create batches by using the `skiprow` argument as suggested in [this answer](https://stackoverflow.com/a/22259008/8294752) and call the `read_csv` multiple times with the different sets. – arabinelli Apr 29 '20 at 13:04
  • thanks a lot... Do you have any links from where i can learn data management? It has recently come to my understanding that collecting the data, and managing it is the most important part of deep/machine learning, and not the models itself xD – DS3a May 01 '20 at 07:23