1

I am trying to read 3GB file (2.5 million rows, mostly categorical (string) data) into Pandas dataframe with read_csv function and get error: out of memory

  • I am on PC with Pandas 0.18 version and 16GB of RAM, so 3GB data should easily fit on 16GB. (Update: This is not a duplicate question)
  • I know that I can provide dtype to improve reading the CSV, but there are too many columns in my data set and I want to load it first, then decide on data type.

The Traceback is:

Traceback (most recent call last):
  File "/home/a/Dropbox/Programming/Python/C and d/main.com.py", line 9, in <module>
    preprocessing()
  File "/home/a/Dropbox/Programming/Python/C and d/main.com.py", line 5, in preprocessing
    df = pd.read_csv(filepath_or_buffer = file_path, sep ='\t', low_memory = False)
  File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 498, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 285, in _read
    return parser.read()
  File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 747, in read
    ret = self._engine.read(nrows)
  File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 1197, in read
    data = self._reader.read(nrows)
  File "pandas/parser.pyx", line 769, in pandas.parser.TextReader.read (pandas/parser.c:8011)
  File "pandas/parser.pyx", line 857, in pandas.parser.TextReader._read_rows (pandas/parser.c:9140)
  File "pandas/parser.pyx", line 1833, in pandas.parser.raise_parser_error (pandas/parser.c:22649)
pandas.parser.CParserError: Error tokenizing data. C error: out of memory

My code:

import pandas as pd
def preprocessing():
    file_path = r'/home/a/Downloads/main_query.txt'
    df = pd.read_csv(filepath_or_buffer = file_path, sep ='\t', low_memory = False)

The above code produced error message, which I posted above.

I then tried to remove low_memory = False, and everything worked, it only gave warning:

sys:1: DtypeWarning: Columns (17,20,23,24,33,44,58,118,134,
135,137,142,145,146,147) have mixed types.
Specify dtype option on import or set low_memory=False.
smci
  • 32,567
  • 20
  • 113
  • 146
user1700890
  • 7,144
  • 18
  • 87
  • 183
  • you may try [this method](http://stackoverflow.com/a/37845530/5741205) – MaxU - stand with Ukraine Sep 16 '16 at 21:58
  • @MaxU Thank you for suggestion. It is also possible to specify `dtype` to reduce memory consumption. Would you please remove duplicate tag. The question you are referring to is poorly stated. Nowhere in Pandas documents you can find a limit on file size, so whether you file is 6GB or 600TB as also as there is enough RAM it should handle it, It might be slow, but this is not the point. Previously there were bugs in pandas with memory handling and thy were resolved. This one appears to be bug as well, so it needs proper attention. – user1700890 Sep 17 '16 at 16:29
  • could you please post a full error traceback? Sure you can use `dtype`, but we don't see your data, so we can't suggest you the values for the `dtype` parameter... – MaxU - stand with Ukraine Sep 17 '16 at 16:33

2 Answers2

3

UPDATE: in Pandas 0.19.0 it should be possible to specify categorical dtype when using read_csv() method:

pd.read_csv(filename, dtype={'col1': 'category'})

so you may try to use pandas 0.19.0 RC1

OLD answer:

you can read your CSV in chunks and concatenate it to the resulting DF on each step:

chunksize = 10**5
df = pd.DataFrame()

for chunk in (pd.read_csv(filename,
                          dtype={'col1':np.int8, 'col2':np.int32, ...}
                          chunksize=chunksize)
             ):
    df = pd.concat([df, chunk], ignore_index=True)

NOTE: parameter dtype is unsupported with engine=’python’

MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
-1

The question is a duplicate. When you do read_csv but don't specify dtypes, if you read in floats, ints, dates and categoricals as (unique) strings, you can easily use up Gigabytes. So take a little time to specify dtypes.

  1. categoricals read in and stored as string (as opposed to categorical) take tons of memory.
  • (pandas will underreport memory usage for dataframes with strings, unless you use df.info(memory_usage='deep') or df.memory_usage(deep=True))
  1. As of pandas 0.19, you now don't need to specify each Categorical variable's levels. Just do pd.read_csv(..., dtype={'foo': 'category', 'bar': 'category', ...})
  1. That should solve everything. In the extremely unlikely event you still run out of memory, then also debug like this:
  • only read in a subset of columns, say usecols = ['foo', 'bar', 'baz']
  • only read in a subset of rows (say nrows=1e5 or see also skiprows=...)
  • and iteratively figure out each categorical's levels and how much memory it uses. You don't need to read in all rows or columns to figure out one categorical column's levels.
smci
  • 32,567
  • 20
  • 113
  • 146
  • This is not an answer to the question. OP is trying to read a 3GB file. Even if all the columns are read as strings, a 16GB machine should not go out of memory – anishtain4 Jan 18 '22 at 04:12
  • @anishtain4: this is absolutely the answer! **Reading floats, ints, dates and categoricals as (unique) strings** can use up Gigabytes. *"I know I can specify dtype to read_csv, but... I want to load it first, then decide on data type."* is the best surefire way to use up tons of unnecessary memory. – smci Jan 18 '22 at 04:30
  • floats, ints, dates, and categoricals are saved as STRINGS in a csv file, there is no compression. If the file size is 3GB, a machine with 16 GB goes out of memory only if Pandas makes multiple copies of the data while it is reading (which is what's happening). So no, you have not answered the question at all. – anishtain4 Jan 18 '22 at 04:35