Problem
I am working on a Kaggle kernel, and simply dropping some rows of a Pandas DataFrame doubles RAM usage. I have seen related questions, such as
Memory leak in pandas when dropping dataframe column?
How do I release memory used by a pandas dataframe?
however none of the solutions proposed there worked for me. I wonder if there is something more basic that I am missing?
Code
Before the transformation in question, the output of psutil.test()
is
USER PID %MEM VSZ RSS NICE STATUS START TIME CMDLINE
root 1 0.0 11.4M 3.0M sleep 13:25 00:00 /bin/bash -c
root 10 0.7 778.8M 125.0M sleep 13:25 00:05 /opt/conda/bi
root 45 8.5 2.9G 1.5G runni 13:26 00:39 /opt/conda/bi
After running the very simple cell
cond = (some condition)
bad_rows = np.where(~cond)[0]
X_train = X_train.drop(index=bad_rows)
I rerun the test and obtain
USER PID %MEM VSZ RSS NICE STATUS START TIME CMDLINE
root 1 0.0 11.4M 2.9M sleep 13:36 00:01 /bin/bash -c
root 10 0.7 778.8M 125.0M runni 13:36 00:06 /opt/conda/bi
root 45 18.2 4.6G 3.2G runni 13:36 00:48 /opt/conda/bi
Since there are relatively few rows in bad_rows, the memory usage of the DataFrame itself does not change (1.1GB), however the system RAM usage doubles!
Attempts
I tried to drop inplace, to run garbage collection, to no avail. I also tried to assign to a different variable like so:
X_train2 = X_train.drop(index=bad_rows)
followed by del X_train
(as well as by X_train = []
when del
alone did not work), but nothing seems to free up space. Incidentally, running the last code snippet produces another weird effect (to me at least), i.e.
X_train: 1.7 GB
X_train2: 1.1 GB
Any ideas what might be causing this?