Memory is not released when taking a small slice of a DataFrame

Question

Summary

adataframe is a DataFrame with 800k rows. Naturally, it consumes a bit of memory. When I do this:

adataframe = adataframe.tail(144)

memory is not released.

You could argue that it is released, but that it appears to be used, but that it's marked free and will be reused by Python. However, if I attempt to create a new 800k-row DataFrame and also keep only a small slice, memory usage grows. If I do it again, it grows again, ad infinitum.

I'm using Debian Jessie's Python 3.4.2 with Pandas 0.18.1 and numpy 1.11.1.

Demonstration with minimal program

With the following program I create a dictionary

data = {
    0:  a_DataFrame_loaded_from_a_CSV,_only_the_last_144_rows,
    1:  same_thing,
    # ...
    9: same_thing,
}

and I monitor memory usage while I'm creating the dictionary. Here it is:

#!/usr/bin/env python3

from resource import getrusage, RUSAGE_SELF

import pandas as pd


def print_memory_usage():
    print(getrusage(RUSAGE_SELF).ru_maxrss)


def read_dataframe_from_csv(f):
    result = pd.read_csv(f, parse_dates=[0],
                        names=('date', 'value', 'flags'),
                        usecols=('date', 'value', 'flags'),
                        index_col=0, header=None,
                        converters={'flags': lambda x: x})
    result = result.tail(144)
    return result


print_memory_usage()
data = {}
for i in range(10):
    with open('data.csv') as f:
        data[i] = read_dataframe_from_csv(f)
    print_memory_usage()

Results

If data.csv only contains a few rows (e.g. 144, in which case the slicing is redundant), memory usage grows very slowly. But if data.csv contains 800k rows, the results are similar to these:

(Adding gc.collect() before print_memory_usage() does not make any significant difference.)

What can I do about it?

Antonis Christofides · Accepted Answer · 2016-07-06T10:38:41.850

As @Alex noted, slicing a dataframe only gives you a view to the original frame, but does not delete it; you need to use .copy() for that. However, even when I used .copy(), memory usage grew and grew and grew, albeit at a slower rate.

I suspect that this has to do with how Python, numpy and pandas use memory. A dataframe is not a single object in memory; it contains pointers to other objects (especially, in this particular case, to strings, which is the "flags" column). When the dataframe is freed, and these objects are freed, the reclaimed free memory space can be fragmented. Later, when a huge new dataframe is created, it might not be able to use the fragmented space, and new space might need to be allocated. The details depend on many little things, such as the Python, numpy and pandas versions, and the particulars of each case.

Rather than investigating these little details, I decided that reading a huge time series and then slicing it is a no go, and that I must read only the part I need right from the start. I like some of the code I created for that, namely the textbisect module and the FilePart class.

score 1 · Answer 2 · edited May 23 '17 at 12:16

1

You could argue that it is released, but that it appears to be used, but that it's marked free and will be reused by Python.

Correct that is how maxrss works (it measures peak memory usage). See here.

So the question then is why is the garbage collector not cleaning up the original DataFrames after they have been subsetted.

I suspect it is because subsetting returns a DataFrame that acts as a proxy to the original one (so values don't need to be copied). This would result in a relatively fast subset operation but also memory leaks like the one you found and weird speed characteristics when setting values.

edited May 23 '17 at 12:16

Community

1
1

answered Jul 03 '16 at 22:18

Alex

18,484
8
60
80

Indeed, if I change `result = result.tail(144)` to `result = result.tail(144).copy(); gc.collect()`, it consumes about 100M less after the 10 iterations. However it still grows and grows and grows, albeit at a lower rate. – Antonis Christofides Jul 04 '16 at 11:07

Memory is not released when taking a small slice of a DataFrame

2 Answers2

Linked