Optimising RAM in reading large file in Python

Question

I have a neural network that takes in arrays of 1024 doubles and returns a single value. I've trained it, and now I want to use it on a set of test data. The file that I want to test it on is very big (8.2G), and whenever I try to import it into Google Colab it crashes the RAM. I'm reading it in using read_csv as follows: testsignals = pd.read_csv("/content/drive/My Drive/MyFile.txt", delimiter="\n",header=None). What I'd like to know is if there's a more efficient way of reading in the file, or whether I will just have to work with smaller data sets.

Edit: following Prune's comment, I looked at the advice from the people who had commented, and I tried this:

import csv
testsignals=[]
with open('/content/drive/MyDrive/MyFile') as csvfile:
  reader=csv.reader(csvfile)
  for line in reader:
    testsignals.append(line)

But it still exceed the RAM.

If anyone can give me some help I'd be very grateful!

Please do the [expected research](/questions/38584494/python-generator-to-read-large-csv-file) before posting a question. The link is a starting point for a variety of solutions. — Prune, Aug 06 '20 at 22:40
Hi @Prune, I looked at the link you posted and most of them seem to rely on csv.reader, so I tried that but the RAM is still exceeded. I've edited the post to show this. — Beth Long, Aug 06 '20 at 23:19
The basic problem is still the same: you're trying to read the entire file into a single data object, all at once. You do not have enough memory to hold it all. You have to read a chunk of data, process it, and dispose of it -- *then* you reuse the space for the next chunk. Look at the generator example. — Prune, Aug 07 '20 at 00:06
in your code you append all lines to `testsignals` so it can't help you - you have to read few lines, use them and remove them from memory before you read few new lines. In `pandas` you can use `for few_lines in read_csv(..., chunksize=rows_number): ... ` to read few lines. See [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) — furas, Aug 07 '20 at 00:06
in `pandas` doc see also [Iterating through files chunk by chunk](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#iterating-through-files-chunk-by-chunk) — furas, Aug 07 '20 at 00:13
some models have the option of learning incrementally so you can feed it chunks of training data, scikit-learn has this [implementation](https://scikit-learn.org/stable/modules/computing.html#incremental-learning) — RichieV, Aug 07 '20 at 00:22
Ok, so from the comments I'm getting the impression that there's no way to optimise the RAM in reading the file into a single array to the pass to my keras model.predict. I'll look into possibilities of batch predictions. Thanks everyone for your help. — Beth Long, Aug 07 '20 at 09:33

Optimising RAM in reading large file in Python

0 Answers0