I am trying to read 3GB file (2.5 million rows, mostly categorical (string) data) into Pandas dataframe with read_csv
function and get error: out of memory
- I am on PC with Pandas 0.18 version and 16GB of RAM, so 3GB data should easily fit on 16GB. (Update: This is not a duplicate question)
- I know that I can provide
dtype
to improve reading the CSV, but there are too many columns in my data set and I want to load it first, then decide on data type.
The Traceback is:
Traceback (most recent call last):
File "/home/a/Dropbox/Programming/Python/C and d/main.com.py", line 9, in <module>
preprocessing()
File "/home/a/Dropbox/Programming/Python/C and d/main.com.py", line 5, in preprocessing
df = pd.read_csv(filepath_or_buffer = file_path, sep ='\t', low_memory = False)
File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 498, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 285, in _read
return parser.read()
File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 747, in read
ret = self._engine.read(nrows)
File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 1197, in read
data = self._reader.read(nrows)
File "pandas/parser.pyx", line 769, in pandas.parser.TextReader.read (pandas/parser.c:8011)
File "pandas/parser.pyx", line 857, in pandas.parser.TextReader._read_rows (pandas/parser.c:9140)
File "pandas/parser.pyx", line 1833, in pandas.parser.raise_parser_error (pandas/parser.c:22649)
pandas.parser.CParserError: Error tokenizing data. C error: out of memory
My code:
import pandas as pd
def preprocessing():
file_path = r'/home/a/Downloads/main_query.txt'
df = pd.read_csv(filepath_or_buffer = file_path, sep ='\t', low_memory = False)
The above code produced error message, which I posted above.
I then tried to remove low_memory = False
, and everything worked, it only gave warning:
sys:1: DtypeWarning: Columns (17,20,23,24,33,44,58,118,134,
135,137,142,145,146,147) have mixed types.
Specify dtype option on import or set low_memory=False.