1

Problem

I am working on a Kaggle kernel, and simply dropping some rows of a Pandas DataFrame doubles RAM usage. I have seen related questions, such as

Memory leak in pandas when dropping dataframe column?

How do I release memory used by a pandas dataframe?

however none of the solutions proposed there worked for me. I wonder if there is something more basic that I am missing?

Code

Before the transformation in question, the output of psutil.test() is

USER         PID  %MEM     VSZ     RSS  NICE STATUS  START   TIME  CMDLINE
root           1   0.0   11.4M    3.0M        sleep  13:25  00:00  /bin/bash -c 
root          10   0.7  778.8M  125.0M        sleep  13:25  00:05  /opt/conda/bi
root          45   8.5    2.9G    1.5G        runni  13:26  00:39  /opt/conda/bi

After running the very simple cell

cond = (some condition)
bad_rows = np.where(~cond)[0]
X_train = X_train.drop(index=bad_rows)

I rerun the test and obtain

USER         PID  %MEM     VSZ     RSS  NICE STATUS  START   TIME  CMDLINE
root           1   0.0   11.4M    2.9M        sleep  13:36  00:01  /bin/bash -c 
root          10   0.7  778.8M  125.0M        runni  13:36  00:06  /opt/conda/bi
root          45  18.2    4.6G    3.2G        runni  13:36  00:48  /opt/conda/bi

Since there are relatively few rows in bad_rows, the memory usage of the DataFrame itself does not change (1.1GB), however the system RAM usage doubles!

Attempts

I tried to drop inplace, to run garbage collection, to no avail. I also tried to assign to a different variable like so:

X_train2 = X_train.drop(index=bad_rows)

followed by del X_train (as well as by X_train = [] when del alone did not work), but nothing seems to free up space. Incidentally, running the last code snippet produces another weird effect (to me at least), i.e.

X_train: 1.7 GB
X_train2: 1.1 GB

Any ideas what might be causing this?

invariant
  • 81
  • 6

0 Answers0