1

I want to process many mp3 files in a loop using a Jupyter Notebook on Kaggle. Reading the mp3 file as binary does however seem to keep the file in memory, even after the function has returned and the file is properly closed. This causes memory usage to grow with each file processed. The issue seems to be in the read() function as a pass does not cause any memory usage growth.

While looping through the mp3 files the memory usage growth is equal to the size of the files being processed, which hints to the files being kept in memory.

How do I read a file without it being kept in memory after the function returns?

def read_mp3_as_bin(fname):
    with open(fname, "rb") as f:
        data = f.read() # when using 'pass' memory usage doesn't grow
    print(f.closed)
    return

for fname in file_names: # file_names are 25K paths to the mp3 files
    read_mp3_as_bin(fname)

"SOLUTION"

I did run the this code locally and no memory usage growth at all. It therefore looks like Kaggle does handle files differently, as that is the only variable in this test. I will try to find out why this code behaves differently on Kaggle, will let you know when I know more.

Mark wijkhuizen
  • 373
  • 3
  • 10
  • 3
    How are you verifying that the file stays in memory? – Paul M. Jun 22 '20 at 09:17
  • 3
    Can you tell us how do you know the file is still in memory? There is no way for `data` to be kept in memory after the function ends. – Asocia Jun 22 '20 at 09:17
  • I added some extra information, the memory usage growth is equal to the size of the files being processed which indicated the files are being kept in memory. – Mark wijkhuizen Jun 22 '20 at 09:31

1 Answers1

2

I am pretty sure you are measuring the memory used wrongly.

I created 3 dummy files with 50MB each and ran your code on them, outputting the memory usage inside and outside the function for each loop iteration, and the result was consistent with the memory being freed after the files are closed.

To measure the memory usage, I have used the solution suggested here, and to create the dummy files, I simply ran truncate -s 50M test_1.txt, as suggested by this blog post.

Have a look:

import os
import psutil


def read_mp3_as_bin(fname):
    with open(fname, "rb") as f:
        data = f.read()  # when using 'pass' memory usage doesn't grow
    if data:
        print("read data")

    process = psutil.Process(os.getpid())
    print(f"inside the function, it is using {process.memory_info().rss / 1024 / 1024} MB")  # in Megabytes
    return


file_names = ['test_1.txt', 'test_2.txt', 'test_3.txt']

for fname in file_names:  # file_names are 25K paths to the mp3 files
    read_mp3_as_bin(fname)
    process = psutil.Process(os.getpid())
    print(f"outside the function, it is using {process.memory_info().rss / 1024 / 1024} MB")  # in Megabytes

output:

read data
inside the function, it is using 61.77734375 MB
outside the function, it is using 11.91015625 MB
read data
inside the function, it is using 61.6640625 MB
outside the function, it is using 11.9140625 MB
read data
inside the function, it is using 61.66796875 MB
outside the function, it is using 11.91796875 MB
jpnadas
  • 774
  • 6
  • 17
  • Interestingly I tried running my own code for smaller file sizes (5MB, 10MB, 20MB and 30MB) and got the weird behavior of memory not being freed upon `return` some of the times. Maybe there is some weird cache going on which is capped at a small memory usage. That said, it doesn't appear to leak indefinitely, when testing with 5 files, only 2 were not freed. For files from 40MB onward the memory is always freed upon `return`. – jpnadas Jun 22 '20 at 10:03
  • Digging deeper, regardless of the file sizes (for the < 40MB files), it only "caches" the second and the third files. – jpnadas Jun 22 '20 at 10:07