psutil shows I have >250GB RAM available, yet I'm having memory errors loading a 6.5GB file

Question

Using python with a 6.5GB dataset on a server that has hundreds of GB of RAM (confirmed with psutil). I'm getting memory errors when trying to load the file into pandas. Here is the output of psutil:

import psutil
psutil.virtual_memory()

svmem(total=405042839552, available=254328373248, percent=37.2, used=148782104576, free=148047446016, active=79192813568, inactive=96666456064, buffers=20480, cached=108213268480, shared=767070208, slab=4305301504)

psutil shows 254.3GB of RAM available, but when I try to load the 6.5GB file, I get the following traceback:

#filename is 6.5GB
df = pd.read_table(filename, sep='\t')

---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<ipython-input-8-0b957ec637b5> in <module>
      
----> 2 df = pd.read_table(filename, sep='\t')

/hpc/packages/minerva-centos7/py_packages/3.7/lib/python3.7/site-packages/pandas/io/parsers.py in read_table(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)
    765         # default to avoid a ValueError
    766         sep = ","
--> 767     return read_csv(**locals())
    768 
    769 

/hpc/packages/minerva-centos7/py_packages/3.7/lib/python3.7/site-packages/pandas/io/parsers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)
    686     )
    687 
--> 688     return _read(filepath_or_buffer, kwds)
    689 
    690 

/hpc/packages/minerva-centos7/py_packages/3.7/lib/python3.7/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    458 
    459     try:
--> 460         data = parser.read(nrows)
    461     finally:
    462         parser.close()

/hpc/packages/minerva-centos7/py_packages/3.7/lib/python3.7/site-packages/pandas/io/parsers.py in read(self, nrows)
   1196     def read(self, nrows=None):
   1197         nrows = _validate_integer("nrows", nrows)
-> 1198         ret = self._engine.read(nrows)
   1199 
   1200         # May alter columns / col_dict

/hpc/packages/minerva-centos7/py_packages/3.7/lib/python3.7/site-packages/pandas/io/parsers.py in read(self, nrows)
   2155     def read(self, nrows=None):
   2156         try:
-> 2157             data = self._reader.read(nrows)
   2158         except StopIteration:
   2159             if self._first_chunk:

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.read()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory()

pandas/_libs/parsers.pyx in pandas._libs.parsers._concatenate_chunks()

<__array_function__ internals> in concatenate(*args, **kwargs)

MemoryError: Unable to allocate 15.3 MiB for an array with shape (2003397,) and data type float64

I'm guessing the memory explodes when using pandas also depends on the datatypes behind the scenes, if you have a string datetime object and pandas attemps to load it as a valid datetime object that will cause a mismatch in memory. Would recommend throwing the data into a SQL db or using `Pyspark` — Umar.H, Dec 21 '20 at 17:13
@MadPhysicist how much could it potentially expand in memory given that it's almost all floats? — cherrytomato967, Dec 21 '20 at 18:16
@drew010 I tried that fix previously, but I'm not root on the server so I don't have permissions — cherrytomato967, Dec 21 '20 at 18:19
@Manakin the data are almost all floats. I'll look into putting the data into a SQL db or using Pyspark — cherrytomato967, Dec 21 '20 at 18:19

psutil shows I have >250GB RAM available, yet I'm having memory errors loading a 6.5GB file

0 Answers0