4

I am having issues understanding why Pandas Dataframes are not cleared from the memory properly. I discovered this after my machine reached 16Gb of memory when it should have remained around 400 Mb. I create a DataFrame and then create a copy it inside the same function. This function is evaluated many times. Each time the function is evaluated, the memory increases - 337 Mb in this example below:

import pandas as pd
import numpy as np
from memory_profiler import profile

@profile
def loop_df():
    for _ in xrange(100):
        copy_df()

# Create a df and then copy it
def copy_df():
    X = pd.DataFrame(np.random.rand(100000,10))
    X2 = X.loc[0:1000,:]
    return 

loop_df()

# Returns the following memory usage:

#Line #    Mem usage    Increment   Line Contents
#================================================
#    13    100.3 MiB      0.0 MiB   @profile
#    14                             def loop_df():
#    15    437.8 MiB    337.5 MiB       for _ in xrange(100):
#    16    437.8 MiB      0.0 MiB           copy_df()

There are various threads which touch on this but there is not a decent solution: Memory leak using pandas dataframe, https://github.com/pandas-dev/pandas/issues/6046, https://github.com/pandas-dev/pandas/issues/2659, Pandas: where's the memory leak here?

Any advice on what can be done to avoid this is welcome. So far using the garbage collector worked with the simple example but fails in my complex code. Using the multiprocessing pool also worked with my complex code. However it would be good to have a solution that doesn't involve having to use the multiprocessing model.

Can anyone explain why this is happening when Python objects such as Numpy arrays and lists do not result in this behavior? It this a bug or the intended behavior of DataFrame objects?

Community
  • 1
  • 1
KieranL
  • 171
  • 1
  • 7

1 Answers1

6

Using del followed by gc.collect() seems to do the trick:

import pandas as pd
import numpy as np
import gc
from memory_profiler import profile

@profile
def loop_df():
    for _ in xrange(100):
        copy_df()

# Create a df and then copy it
@profile
def copy_df():
    X = pd.DataFrame(np.random.rand(100000,10))
    X2 = X.loc[0:1000,:]
    del X, X2
    gc.collect()

loop_df()

Then after that, if you are still running out of memory, here is one possible solution using the numpy memmap (memory mapped) data structure:

import pandas as pd
import numpy as np
from memory_profiler import profile
import gc

@profile
def loop_df():
    for _ in xrange(100):
        copy_df()
@profile
def copy_df():
    mmap = np.memmap('mymemmap', dtype='float64', mode='w+', shape=(100000,10))
    mmap[:] = np.random.rand(100000,10)
    df = pd.DataFrame(mmap)
    df2 = df.loc[0:1000,:]
    del df, df2, mmap
    gc.collect()
    pass

if __name__ == '__main__':
    loop_df()

Memory-mapped files are used for accessing small segments of large files on disk, without reading the entire file into memory.

Sorry I cannot explain why your example code does not free up the pandas data already. I suspect it has something to do with numpy and pandas using native arrays or something.

Alex G Rice
  • 1,561
  • 11
  • 16