6

So I have been trying to read a 3.2GB file in memory using pandas read_csv function but I kept on running into some sort of memory leak, my memory usage would spike 90%+.

So as alternatives

  1. I tried defining dtype to avoid keeping the data in memory as strings, but saw similar behaviour.

  2. Tried out numpy read csv, thinking I would get some different results but was definitely wrong about that.

  3. Tried reading line by line ran into the same problem, but really slowly.

  4. I recently moved to python 3, so thought there could be some bug there, but saw similar results on python2 + pandas.

The file in question is a train.csv file from a kaggle competition grupo bimbo

System info:

RAM: 16GB, Processor: i7 8cores

Let me know if you would like to know anything else.

Thanks :)

EDIT 1: its a memory spike! not a leak (sorry my bad.)

EDIT 2: Sample of the csv file

Semana,Agencia_ID,Canal_ID,Ruta_SAK,Cliente_ID,Producto_ID,Venta_uni_hoy,Venta_hoy,Dev_uni_proxima,Dev_proxima,Demanda_uni_equil
3,1110,7,3301,15766,1212,3,25.14,0,0.0,3
3,1110,7,3301,15766,1216,4,33.52,0,0.0,4
3,1110,7,3301,15766,1238,4,39.32,0,0.0,4
3,1110,7,3301,15766,1240,4,33.52,0,0.0,4
3,1110,7,3301,15766,1242,3,22.92,0,0.0,3

EDIT 3: number rows in the file 74180465

Other then a simple pd.read_csv('filename', low_memory=False)

I have tried

from numpy import genfromtxt
my_data = genfromtxt('data/train.csv', delimiter=',')

UPDATE The below code just worked, but I still want to get to the bottom of this problem, there must be something wrong.

import pandas as pd
import gc
data = pd.DataFrame()
data_iterator = pd.read_csv('data/train.csv', chunksize=100000)
for sub_data in data_iterator:
    data.append(sub_data)
    gc.collect()

enter image description here

enter image description here

EDIT: Piece of Code that worked. Thanks for all the help guys, I had messed up my dtypes by adding python dtypes instead of numpy ones. Once I fixed that the below code worked like a charm.

dtypes = {'Semana': pd.np.int8,
          'Agencia_ID':pd.np.int8,
          'Canal_ID':pd.np.int8,
          'Ruta_SAK':pd.np.int8,
          'Cliente_ID':pd.np.int8,
          'Producto_ID':pd.np.int8,
          'Venta_uni_hoy':pd.np.int8,
          'Venta_hoy':pd.np.float16,
          'Dev_uni_proxima':pd.np.int8,
          'Dev_proxima':pd.np.float16,
          'Demanda_uni_equil':pd.np.int8}
data = pd.read_csv('data/train.csv', dtype=dtypes)

This brought down the memory consumption to just under 4Gb

heaven00
  • 200
  • 3
  • 9
  • 7
    That doesn't sound like a memory leak. That sounds like you're trying to read a huge file into memory all at once and incurring a huge amount of memory consumption to do so. (It's completely normal for the in-memory representation to be larger than the serialized form.) – user2357112 Jul 20 '16 at 17:46
  • 1
    memory leak = memory that is allocated but never freed. This is not the case here. A spike doesn't mean a memory leak at all. – limbo Jul 20 '16 at 17:48
  • @limbo thanks for the correction but 10Gb+ usage for a 3.2Gb file the numbers really feel wrong. – heaven00 Jul 20 '16 at 17:52
  • Can you write a sample line of the data? – chapelo Jul 20 '16 at 17:54
  • 1
    @heaven00 no worries, we are all here to learn. I am a bit puzzled by that as well. Can you show us the code you use? – limbo Jul 20 '16 at 17:55
  • @chapelo added the sample of the csv file – heaven00 Jul 20 '16 at 17:58
  • When you said "spike", I was expecting an actual spike. This behavior (that's 1 min on the horizontal axis, I think) looks like your machine is just slowly running out of memory as you read in a large amount of data. – Richard Jul 20 '16 at 18:01
  • How many lines long is the file? Have you estimated how much memory you expect it to take? – Richard Jul 20 '16 at 18:02
  • @Richard I have other pictures, I just took the one where it was gradually increasing while I was reading line by line – heaven00 Jul 20 '16 at 18:03
  • @limbo udpated the question – heaven00 Jul 20 '16 at 18:03
  • @Richard added another picture with the spike – heaven00 Jul 20 '16 at 18:05
  • @heaven00, thanks. Could you post how many lines are in the file as well? – Richard Jul 20 '16 at 18:05
  • 3x size from serialized file to internal representation doesn't sound that unreasonable to me... The real issue is that you're holding the entire file in memory.. I would recommend using the csv library and writing your output to file as you go instead of saving the whole thing in memory – Aaron Jul 20 '16 at 18:13
  • @heaven00: In the second picture, at what point do you begin reading the file? Is it a slow ramp followed by that spike, or just the spike? – Richard Jul 20 '16 at 18:16
  • @Aaron is 3x normal for pandas ? I want to load it in memory for data manipulations and analysis – heaven00 Jul 20 '16 at 18:18
  • @heaven00: Did you see [this answer](http://stackoverflow.com/a/27232309/752843)? – Richard Jul 20 '16 at 18:18
  • @Richard the read function was running until the point of the fall – heaven00 Jul 20 '16 at 18:25
  • @Richard yes I tried with defining dtypes but still had the same issue – heaven00 Jul 20 '16 at 18:26
  • 3x feels right.. One for pd.DataFrame. one for read-csv another dataframe , one for append Dataframe which -- I think creates a copy. So, 3x feels right. – Merlin Jul 20 '16 at 18:43
  • @Merlin i wasn't using append, which does create a copy initially – heaven00 Jul 20 '16 at 18:49
  • My gut feeling is that you are trying to analyze the data with the wrong tools. csv raw data isn't the best way to handle that amount of data, you should definitely put that in a database system and analyze or at least pre-filter there or you should read and analyze by smaller chunks. – caiohamamura Jul 20 '16 at 19:01

2 Answers2

2

A file stored in memory as text is not as compact as a compressed binary format, however it is relatively compact data-wise. If it's a simple ascii file, aside from any file header information, each character is only 1 byte. Python strings have a similar relation, where there's some overhead for internal python stuff, but each extra character adds only 1 byte (from testing with __sizeof__). Once you start converting to numeric types and collections (lists, arrays, data frames, etc.) the overhead will grow. A list for example must store a type and a value for each position, whereas a string only stores a value.

>>> s = '3,1110,7,3301,15766,1212,3,25.14,0,0.0,3\r\n'
>>> l = [3,1110,7,3301,15766,1212,3,25.14,0,0.0,3]
>>> s.__sizeof__()
75
>>> l.__sizeof__()
128

A little bit of testing (assuming __sizeof__ is accurate):

import numpy as np
import pandas as pd

s = '1,2,3,4,5,6,7,8,9,10'
print ('string: '+str(s.__sizeof__())+'\n')
l = [1,2,3,4,5,6,7,8,9,10]
print ('list: '+str(l.__sizeof__())+'\n')
a = np.array([1,2,3,4,5,6,7,8,9,10])
print ('array: '+str(a.__sizeof__())+'\n')
b = np.array([1,2,3,4,5,6,7,8,9,10], dtype=np.dtype('u1'))
print ('byte array: '+str(b.__sizeof__())+'\n')
df = pd.DataFrame([1,2,3,4,5,6,7,8,9,10])
print ('dataframe: '+str(df.__sizeof__())+'\n')

returns:

string: 53

list: 120

array: 136

byte array: 106

dataframe: 152
Aaron
  • 10,133
  • 1
  • 24
  • 40
  • p = pd.DataFrame(l).__sizeof__() -- 160 – Merlin Jul 20 '16 at 18:45
  • wouldn't pandas `dtype` help in this matter ? http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html – heaven00 Jul 20 '16 at 18:46
  • possibly... I'm not a big user of pandas, although I'm aware that it plays nice with numpy under the hood – Aaron Jul 20 '16 at 18:54
  • @Aaron thanks you guys were spot on, I was adding the wrong dtypes. I used python dtypes which were causing issues. Changed them to numpy dtypes and worked like a charm, with total usage around 4Gb almost same as the filesize – heaven00 Jul 20 '16 at 18:59
  • 1
    @Richard thanks to you too, your link also had the similar answer I just messed up :) – heaven00 Jul 20 '16 at 19:00
  • I did a bit more testing, and `pd.DataFrame(b)` is the smalllest yet!! :) – Aaron Jul 20 '16 at 19:02
1

Based on your second chart, it looks as though there's a brief period in time where your machine allocates an additional 4.368 GB of memory, which is approximately the size of your 3.2 GB dataset (assuming 1GB overhead, which might be a stretch).

I tried to track down a place where this could happen and haven't been super successful. Perhaps you can find it, though, if you're motivated. Here's the path I took:

This line reads:

def read(self, nrows=None):
    if nrows is not None:
        if self.options.get('skip_footer'):
            raise ValueError('skip_footer not supported for iteration')

    ret = self._engine.read(nrows)

Here, _engine references PythonParser.

That, in turn, calls _get_lines().

That makes calls to a data source.

Which looks like it reads in in the form of strings from something relatively standard (see here), like TextIOWrapper.

So things are getting read in as standard text and converted, this explains the slow ramp.

What about the spike? I think that's explained by these lines:

ret = self._engine.read(nrows)

if self.options.get('as_recarray'):
    return ret

# May alter columns / col_dict
index, columns, col_dict = self._create_index(ret)

df = DataFrame(col_dict, columns=columns, index=index)

ret becomes all the components of a data frame`.

self._create_index() breaks ret apart into these components:

def _create_index(self, ret):
    index, columns, col_dict = ret
    return index, columns, col_dict

So far, everything can be done by reference, and the call to DataFrame() continues that trend (see here).

So, if my theory is correct, DataFrame() is either copying the data somewhere, or _engine.read() is doing so somewhere along the path I've identified.

Richard
  • 56,349
  • 34
  • 180
  • 251
  • 1
    Wow thanks for the detailed answer, I think this was mostly happening because my dtypes were defined wrong and it was trying to find the exact dtypes for the data. but this is still an assumption so far. – heaven00 Jul 20 '16 at 19:04