I am trying to use Amazon EC2 Free Tier for running a Python script. The input file size is 4 GB. While trying to read it into a dataframe using pandas read_csv, I am getting the below mentioned error.
I have tried using both chunksize & low_memory options to fix the issue, but still getting similar errors with every option:
train = pd.DataFrame()
chunks = []
for chunk in pd.read_csv('train.csv', chunksize=1000, low_memory=False):
chunks.append(chunk)
train = pd.concat(chunks, axis=0)
Error description:
Traceback (most recent call last):
File "imports.py", line 54, in <module>
for chunk in pd.read_csv('../data/train.csv', chunksize=1000, low_memory=False):
File "/home/ec2-user/.local/lib/python3.7/site-packages/pandas/io/parsers.py", line 1007, in __next__
return self.get_chunk()
File "/home/ec2-user/.local/lib/python3.7/site-packages/pandas/io/parsers.py", line 1070, in get_chunk
return self.read(nrows=size)
File "/home/ec2-user/.local/lib/python3.7/site-packages/pandas/io/parsers.py", line 1036, in read
ret = self._engine.read(nrows)
File "/home/ec2-user/.local/lib/python3.7/site-packages/pandas/io/parsers.py", line 1848, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 879, in pandas._libs.parsers.TextReader.read
File "pandas/_libs/parsers.pyx", line 945, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 932, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas/_libs/parsers.pyx", line 2112, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: out of memory
Is there any mechanism that can be applied in the script or on EC2 to solve this issue?