Summary
adataframe
is a DataFrame
with 800k rows. Naturally, it consumes a bit of memory. When I do this:
adataframe = adataframe.tail(144)
memory is not released.
You could argue that it is released, but that it appears to be used, but that it's marked free and will be reused by Python. However, if I attempt to create a new 800k-row DataFrame
and also keep only a small slice, memory usage grows. If I do it again, it grows again, ad infinitum.
I'm using Debian Jessie's Python 3.4.2 with Pandas 0.18.1 and numpy 1.11.1.
Demonstration with minimal program
With the following program I create a dictionary
data = {
0: a_DataFrame_loaded_from_a_CSV,_only_the_last_144_rows,
1: same_thing,
# ...
9: same_thing,
}
and I monitor memory usage while I'm creating the dictionary. Here it is:
#!/usr/bin/env python3
from resource import getrusage, RUSAGE_SELF
import pandas as pd
def print_memory_usage():
print(getrusage(RUSAGE_SELF).ru_maxrss)
def read_dataframe_from_csv(f):
result = pd.read_csv(f, parse_dates=[0],
names=('date', 'value', 'flags'),
usecols=('date', 'value', 'flags'),
index_col=0, header=None,
converters={'flags': lambda x: x})
result = result.tail(144)
return result
print_memory_usage()
data = {}
for i in range(10):
with open('data.csv') as f:
data[i] = read_dataframe_from_csv(f)
print_memory_usage()
Results
If data.csv
only contains a few rows (e.g. 144, in which case the slicing is redundant), memory usage grows very slowly. But if data.csv
contains 800k rows, the results are similar to these:
52968
153388
178972
199760
225312
244620
263656
288300
309436
330568
349660
(Adding gc.collect()
before print_memory_usage()
does not make any significant difference.)
What can I do about it?