1

By running only the following code in a single Jupyter Notebook cell, I load a file of 1GB in memory in a function and I return the result after the function definition:

import pickle

def fun():
    with open('./data/input/train_test_clevr_pkls/train_clevr_1000.pkl', 'rb') as handle:
        return pickle.load(handle)

fun()

The Jupyter Notebook will then print the output of the function. Looking at the memory consumed after I run the cell it sits at 1GB as expected. However, if I run the same cell multiple times the memory footprint increases by 1GB each time until my entire RAM is consumed and then my Windows Operating System uses page file swapping to handle even more memory marked as being consumed which destroys my application's performance. I already tried to use the gc.collect() to free the memory but to no avail.

I saw similar questions being asked but I found no answer to my problem. The memory is NOT being reused internally, it only grows!

zuijiang
  • 434
  • 2
  • 13
  • 1
    if you do the `gc.collect()` in the same cell as the `fun()` call – rioV8 May 02 '21 at 12:56
  • If I use ```gc.collect()```, in the same cell, after the ```fun()``` call the memory does not leak and the garbage collector returns 0, meaning it did not collect anything. Or if I wrap the call into something like ```print(fun())``` or add any instruction after ```fun()```, there is no memory leak anymore. However I still don't get why it leaks in the first place, and I cannot recover the initial memory loss unless I restart the kernel. – Andrei Ionescu May 02 '21 at 13:07

1 Answers1

2

The reason you are seeing this is because Jupyter stores all the references to a name called Out.

MRE (all this in one cell)

from itertools import product

def foo():
    return [*product(range(5), repeat=5)]

[foo() for _ in range(5)]

When you run this in a cell Jupyter saves this to the dict Out.

Partial Contents of Out

{1: [[(0, 0, 0, 0, 0),
   (0, 0, 0, 0, 1),
   (0, 0, 0, 0, 2),
   (0, 0, 0, 0, 3),
   (0, 0, 0, 0, 4),
   (0, 0, 0, 1, 0),
   (0, 0, 0, 1, 1),
   (0, 0, 0, 1, 2),
   (0, 0, 0, 1, 3),
   (0, 0, 0, 1, 4),
   (0, 0, 0, 2, 0),
   (0, 0, 0, 2, 1),...}

When you run the code block again it stores that value to a new key 3 in the Out. This keeps on adding new key value pair to the Out dict every time you run the cell.

Now the reason for why it does not happen when you do print(...)

from itertools import product

def foo():
    return [*product(range(5), repeat=5)]

print([foo() for _ in range(5)])

Jupyter does not save the result to the Out dict in that case. No matter how many times you run the cell, the Out dict will always be {}.

python_user
  • 5,375
  • 2
  • 13
  • 32