Memory Error on AWS EC2 while using Python pandas read_csv

Question

I am trying to use Amazon EC2 Free Tier for running a Python script. The input file size is 4 GB. While trying to read it into a dataframe using pandas read_csv, I am getting the below mentioned error.

I have tried using both chunksize & low_memory options to fix the issue, but still getting similar errors with every option:

train = pd.DataFrame()
chunks = []

for chunk in pd.read_csv('train.csv', chunksize=1000, low_memory=False):
    chunks.append(chunk)

train = pd.concat(chunks, axis=0)

Error description:

    Traceback (most recent call last):
  File "imports.py", line 54, in <module>
    for chunk in pd.read_csv('../data/train.csv', chunksize=1000, low_memory=False):
  File "/home/ec2-user/.local/lib/python3.7/site-packages/pandas/io/parsers.py", line 1007, in __next__
    return self.get_chunk()
  File "/home/ec2-user/.local/lib/python3.7/site-packages/pandas/io/parsers.py", line 1070, in get_chunk
    return self.read(nrows=size)
  File "/home/ec2-user/.local/lib/python3.7/site-packages/pandas/io/parsers.py", line 1036, in read
    ret = self._engine.read(nrows)
  File "/home/ec2-user/.local/lib/python3.7/site-packages/pandas/io/parsers.py", line 1848, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 879, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 945, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 932, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 2112, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: out of memory

Is there any mechanism that can be applied in the script or on EC2 to solve this issue?

Try this? https://stackoverflow.com/questions/18039057/python-pandas-error-tokenizing-data/29383624 — xyzjayne, Jul 18 '18 at 19:20
Well the chunk option isn't going to help you when you then just try to concatenate everything together in the end, since you couldn't fit it into memory in the first place. It's really meant for you to operate on chunks one at a time, that way you only need to read a smaller piece into memory. — ALollz, Jul 18 '18 at 19:20
@ALollz True; note that the error was thrown at an earlier line. — xyzjayne, Jul 18 '18 at 19:22
@ALollz right, i thought so too, but gave it a try since it was mentioned on many forums — cod_rg567, Jul 18 '18 at 19:33
@xyzjayne tried ", error_bad_lines=False" from stackoverflow.com/questions/18039057/… - same issue. Also the file works fine if used on local, but because of the size, my system was getting slow - so was trying to use EC2 — cod_rg567, Jul 18 '18 at 19:34
@xyzjayne I could be wrong, but I believe the error occurs on that line because with chunksize specified it's just a lazy iterator, so the error won't occur until it tries to read in the chunk, but runs out of memory — ALollz, Jul 18 '18 at 19:39
If it's truly a memory error, I'd recommend reading in fewer columns instead... — xyzjayne, Jul 18 '18 at 20:02
Try using `nrows` argument in read_csv. If it is due to huge size of csv file you can use smaller values like 100 or 1000. — Krishna, Jul 18 '18 at 20:04
I am able to read the file in parts - but actually I need to be able to work on the entire file with all rows & columns (this is the primary reason i switched to AWS). I was wondering if there is a way to use more memory during processing on EC2? — cod_rg567, Jul 18 '18 at 20:23

Memory Error on AWS EC2 while using Python pandas read_csv

0 Answers0