1

I have several large files in a folder. Each single file fits in my RAM, but all of them not. I have the following loop processing each file:

for dataset_index,path in enumerate(file_paths_train):
    with np.load(path) as dataset:
        x_batch = dataset['data']
        y_batch = dataset['labels']

        for i in range(x_batch.shape[0]):
            if y_batch[i] in INDICES:
                # This selects a minimal subset of the data
                data_list.append((y_batch[i], x_batch[i]))
# End loop

(the paths for all files are stored in the variable file_paths_train)

This answer stated that using with ... as ... would automatically delete the variable associated with the file once the program is out of the scope. Except it isn't, memory usage increases until the computer stops working and I need to restart.

Ideas?

Bersan
  • 1,032
  • 1
  • 17
  • 28
  • 3
    The idea is that you have not shown us the complete code and you're filling a list or some other object with data from the file. `with` will only take care of that exact object (`dataset`), not all data that was created/copied from that object. – Thomas Weller Sep 16 '21 at 12:01
  • `with ...` doesn't _delete_ the file, just the file handle is automatically closed. – Maurice Meyer Sep 16 '21 at 12:02
  • @ThomasWeller Yes I am storing part of the data.. but it's minimal compared to the size of the files. I debugged and the code where data is being captured makes no difference in memory use. I noticed that the lines that actually use memory are `x_batch = dataset['data']` and `y_batch = dataset['labels']`. Still this makes no sense to me, these variables should be freed automatically, no? – Bersan Sep 16 '21 at 12:03
  • @MauriceMeyer You're right, I corrected it on the question – Bersan Sep 16 '21 at 12:05
  • 3
    If you are keeping any indirect references to the `dataset` object inside the loop, the garbage collector will have to keep all the `dataset` objects around. So the code in the inner loop is relevant. [mcve] – Håken Lid Sep 16 '21 at 12:06
  • @DarkKnight Shouldn't `x_batch` and `y_batch` be freed automatically on each iteration? Why would they be kept after the scope ends? – Bersan Sep 16 '21 at 12:07
  • 1
    @Bersan We cannot answer that because we don't know what happens in the inner loop. – Håken Lid Sep 16 '21 at 12:08
  • I *think* that when they go out of scope they become eligible for garbage collection but the memory occupied isn't necessarily released immediately. But I'm not certain. It definitely won't do any harm to try –  Sep 16 '21 at 12:09
  • 1
    try saving copies of the dataset selections -`data_list.append((y_batch[i].copy(), x_batch[i].copy()))` – hpaulj Sep 16 '21 at 12:42
  • 1
    Just because Python reclaims memory for its *own* use doesn't mean it necessarily can or will reuse it in favor of requesting more memory from the OS. Particularly, most of the memory used isn't by the objects referenced by `x_batch` and `y_batch` but by the objects that *those* objects reference, some of which you keep references to in `data_list`. As a result, memory becomes fragmented and the memory subsystem may request large contiguous blocks from the OS rather than trying to track multiple smaller blocks it already has access to. – chepner Sep 16 '21 at 12:45
  • 1
    `x[i]` of a multidimensional array is a `view` of that array. The whole `x` remains in memory even if just this view is saved in a list (by reference). – hpaulj Sep 16 '21 at 14:54
  • @hpaulj yeah, didn't realize that – Bersan Sep 19 '21 at 09:52

1 Answers1

2

Indexing a multidimensional array with a scalar creates a view. If that view is saved in a list, the original array remains, regardless of what happens to its variable references.

In [95]: alist = []
    ...: for i in range(3):
    ...:     x = np.ones((10,10),int)*i
    ...:     alist.append(x[0])
    ...: 
In [96]: alist
Out[96]: 
[array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]),
 array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1]),
 array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2])]
In [97]: [item.base for item in alist]
Out[97]: 
[array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]),
 array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]),
 array([[2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
        [2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
        [2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
        [2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
        [2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
        [2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
        [2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
        [2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
        [2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
        [2, 2, 2, 2, 2, 2, 2, 2, 2, 2]])]

You have to append a copy if you want to truely 'throw-away' the original array.

In [98]: alist = []
    ...: for i in range(3):
    ...:     x = np.ones((10,10),int)*i
    ...:     alist.append(x[0].copy())
    ...: 
    ...: 
In [99]: alist
Out[99]: 
[array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]),
 array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1]),
 array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2])]
In [100]: [item.base for item in alist]
Out[100]: [None, None, None]
hpaulj
  • 221,503
  • 14
  • 230
  • 353