0

Using python with a 6.5GB dataset on a server that has hundreds of GB of RAM (confirmed with psutil). I'm getting memory errors when trying to load the file into pandas. Here is the output of psutil:

import psutil
psutil.virtual_memory()

svmem(total=405042839552, available=254328373248, percent=37.2, used=148782104576, free=148047446016, active=79192813568, inactive=96666456064, buffers=20480, cached=108213268480, shared=767070208, slab=4305301504)

psutil shows 254.3GB of RAM available, but when I try to load the 6.5GB file, I get the following traceback:

#filename is 6.5GB
df = pd.read_table(filename, sep='\t')

---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<ipython-input-8-0b957ec637b5> in <module>
      
----> 2 df = pd.read_table(filename, sep='\t')

/hpc/packages/minerva-centos7/py_packages/3.7/lib/python3.7/site-packages/pandas/io/parsers.py in read_table(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)
    765         # default to avoid a ValueError
    766         sep = ","
--> 767     return read_csv(**locals())
    768 
    769 

/hpc/packages/minerva-centos7/py_packages/3.7/lib/python3.7/site-packages/pandas/io/parsers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)
    686     )
    687 
--> 688     return _read(filepath_or_buffer, kwds)
    689 
    690 

/hpc/packages/minerva-centos7/py_packages/3.7/lib/python3.7/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    458 
    459     try:
--> 460         data = parser.read(nrows)
    461     finally:
    462         parser.close()

/hpc/packages/minerva-centos7/py_packages/3.7/lib/python3.7/site-packages/pandas/io/parsers.py in read(self, nrows)
   1196     def read(self, nrows=None):
   1197         nrows = _validate_integer("nrows", nrows)
-> 1198         ret = self._engine.read(nrows)
   1199 
   1200         # May alter columns / col_dict

/hpc/packages/minerva-centos7/py_packages/3.7/lib/python3.7/site-packages/pandas/io/parsers.py in read(self, nrows)
   2155     def read(self, nrows=None):
   2156         try:
-> 2157             data = self._reader.read(nrows)
   2158         except StopIteration:
   2159             if self._first_chunk:

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.read()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory()

pandas/_libs/parsers.pyx in pandas._libs.parsers._concatenate_chunks()

<__array_function__ internals> in concatenate(*args, **kwargs)

MemoryError: Unable to allocate 15.3 MiB for an array with shape (2003397,) and data type float64
  • 1
    For one thing, your array is not 6.5GB in memory... – Mad Physicist Dec 21 '20 at 17:13
  • I'm guessing the memory explodes when using pandas also depends on the datatypes behind the scenes, if you have a string datetime object and pandas attemps to load it as a valid datetime object that will cause a mismatch in memory. Would recommend throwing the data into a SQL db or using `Pyspark` – Umar.H Dec 21 '20 at 17:13
  • does this help? https://stackoverflow.com/a/57511555/892493 – drew010 Dec 21 '20 at 17:19
  • @MadPhysicist how much could it potentially expand in memory given that it's almost all floats? – cherrytomato967 Dec 21 '20 at 18:16
  • @drew010 I tried that fix previously, but I'm not root on the server so I don't have permissions – cherrytomato967 Dec 21 '20 at 18:19
  • @Manakin the data are almost all floats. I'll look into putting the data into a SQL db or using Pyspark – cherrytomato967 Dec 21 '20 at 18:19

0 Answers0