4

I have a huge dataset for training a deep learning model. It's in a .csv format. It's around 2GB and right now, I'm just loading the entire data into memory with pandas.

df = pd.read_csv('test.csv')

and then providing everything into the keras model and then training the model like below,

model.fit(df, targets)

I want to know what other options I have when dealing with even large datasets. Like around 10 GB (or) something. I don't have the ram to load everything on to the memory and pass it to the model.

One way I could think of is to somehow get random sample/subset of data from the .csv file and use it via a data generator but the problem is I couldn't find any way to read a subset/sample of a csv file without loading everything into the memory.

How can I train the model without loading everything in to the memory? It's okay if you have any solutions and it uses some memory. Just let me know.

user_12
  • 1,778
  • 7
  • 31
  • 72
  • @aws_apprentice Is that the only other way? – user_12 Jan 25 '20 at 14:06
  • 1
    the docs mention you can supply a _generator_ as your `x` argument so that is an option, although under the hood I assume `keras` loads it all in anyways? this what the docs say, `A generator or keras.utils.Sequence returning (inputs, targets) or (inputs, targets, sample weights).` – gold_cy Jan 25 '20 at 14:07
  • it'll eventually have to load/process it all, but if you use a `Sequence` it can unload earlier parts as it goes. CSV files aren't great for this (random access is difficult), I'd suggesting splitting a single large CSV file up into several smaller files, or maybe use another file format – Sam Mason Jan 25 '20 at 14:13
  • possible duplicate of [this](https://stackoverflow.com/questions/23411619/reading-large-text-files-with-pandas) –  Jan 25 '20 at 14:13
  • 1
    see e.g. https://stanford.edu/~shervine/blog/keras-how-to-generate-data-on-the-fly @dominicXDhough I think this is more about keras rather than CSV/Pandas processing – Sam Mason Jan 25 '20 at 14:14
  • @SamMason I can build a data generator but I don't think it's possible to randomly read few samples every time from a csv file. I couldn't find anyother way to deal with such scenarios – user_12 Jan 25 '20 at 14:15
  • 1
    If this model in Keras supports online learning that might be helpful. Also if it can take non-pandas input you can likely save a lot of memory avoiding building the DataFrame. – totalhack Jan 25 '20 at 14:26
  • Does this answer your question? [Reading large text files with Pandas](https://stackoverflow.com/questions/23411619/reading-large-text-files-with-pandas) – AMC Jan 25 '20 at 22:40
  • https://stackoverflow.com/q/25962114/11301900 – AMC Jan 25 '20 at 22:41
  • @AMC Can you please go over the question again. My problem is not only about reading large text file. I want a way on how to deal with large datasets and pass it for training deep learning models. I know I can read the dataset via chunks but how can we pass it to `model.fit()`. It's not possible. – user_12 Jan 25 '20 at 22:58

1 Answers1

0

I've not used this functionality before, but maybe something like:

class CsvSequence(Sequence):
    def __init__(self, batchnames):
       self.batchnames = batchnames

    def __len__(self):
       return len(self.batchnames)

    def __getitem__(self, i):
       name = self.batchnames[i]
       X = pd.read_csv(name + '-X.csv')
       Y = pd.read_csv(name + '-Y.csv')
       return X, Y

would work. you'd need to preprocess your data by splitting your 10GB file up into, e.g., 10 smaller files. the Unix split utility might be enough if your CSV files have one record per line (most do)

as an incomplete example of how to use this:

seq = CsvSequence([
  'data-1', 'data-2', 'data-3'])

model.fit_generator(seq)

but note that you'd quickly want to do something more efficient, the above would cause your CSV files to be read many times. it wouldn't surprise me if this loading took more time than the everything else put together

one suggestion would be to preprocess the files before training, saving them to numpy binary files. the binary files could then mmaped in while loading which is much more efficient.

Sam Mason
  • 15,216
  • 1
  • 41
  • 60
  • How would this work? Can you explain a bit regarding this. After splitting our main dataset into multiple files. Will this class load each file? How can I connect with the datagenerator and then to keras fit – user_12 Jan 25 '20 at 14:32
  • you'd just use [`model.fit_generator`](https://keras.io/models/model/#fit_generator), passing an instance of this class. note that using CSV files is pretty inefficient, and it'll waste a lot of time reloading files from disk. you'd want to use something that can be read efficiently (or ideally `mmap`ed in). maybe have a look at `numpy.load` to see what it can handle – Sam Mason Jan 25 '20 at 14:41