1

I've been trying to learn file I/O in Python, but have come across some sort of memory leak that I can't solve for no apparent reason.

file = "D:\\babelStorage\\Testing"
x = 1000000
while (x > 0):
    with open("".join([file, "\\", "junk", str(x), ".txt"]), "wt") as trash:
        trash.write("garbage")
    x = x - 1

The same issue seems to occur even when I explicitly use trash.close(). What exactly am I doing wrong that's causing huge chunks of memory to accumulate?

None of the memory shows up as a process on task manager. If I run it long enough I can get 10GB which are... somewhere. Closing the python shell doesn't recover the memory, either, I have to reboot.

Rayalot72
  • 13
  • 3
  • 3
    If the memory doesn't show up in task manager, how are you detecting a memory leak? – afarley Jul 16 '20 at 18:54
  • Memory usage doesn't change at all until I run this sort of file I/O. When I do, it very steadily climbs and doesn't go back down. Started at under 3GB with this machine running overnight not doing anything. Just testing file I/O within the past hour now has it at a constant 20GB. Highest memory usage displayed is Google Chrome at ~1,200 MB. – Rayalot72 Jul 16 '20 at 18:58
  • Oh, I see what you're reading. When I said "none of the memory shows up on task manager" I meant it doesn't show up as a process, it still shows a very high memory usage. Edited to be more clear. – Rayalot72 Jul 16 '20 at 19:02
  • 2
    Probably just the operating system's block cache storing the data you wrote to disk so it can be retrieved from memory if/when something needs it. It's *normal* for the block cache to not be empty -- in fact, it's *better* for it to not be empty: An empty block cache is a block cache that isn't doing anything to make your system faster. As soon as something else has a use for that memory, the OS will free it up. – Charles Duffy Jul 16 '20 at 19:03
  • It's possible that the Python interpreting is deciding not to release that memory, even though it should be no longer necessary after each iteration of the loop. You could try calling gc.collect() to confirm as discussed here: https://stackoverflow.com/questions/1316767/how-can-i-explicitly-free-memory-in-python – afarley Jul 16 '20 at 19:04
  • 1
    @afarley, the OP says the Python interpreter is _exiting_. A process that has exited cannot decide not to release memory (unless it's something exotic like a SHM block, but Python doesn't do anything like that automatically). – Charles Duffy Jul 16 '20 at 19:05
  • @CharlesDuffy good point, my previous suggestion is probably wrong. – afarley Jul 16 '20 at 19:06
  • 1
    @Rayalot72, ...if this _isn't_ a cache in the OS, it's an operating system or driver bug. In no scenario is it Python doing anything wrong; as soon as a process exits, it's the operating system's job to release its memory and other resources, so if the OS fails to do so, it's an OS bug... _or_, as described above (and far, _far_ more likely to be the case), an intentional and desired behavior. – Charles Duffy Jul 16 '20 at 19:08
  • @Rayalot72, ...if you gave us details about how you were measuring "used" memory, we could speak to whether that measure counts data that's storing a transient cache as "used". – Charles Duffy Jul 16 '20 at 19:09
  • @Charles Duffy Not sure how to answer that? I had just seen very high numbers on task manager I couldn't account for and thought it was a memory leak. If that includes caches, then it's probably not something to worry about as you said (although I'll try to run out of memory to make sure). – Rayalot72 Jul 16 '20 at 19:16
  • @afarley Doesn't seem to be holding on to memory, gc.collect() in the shell returned ~400 but memory usage didn't change. Closing and opening the shell didn't change anything. – Rayalot72 Jul 16 '20 at 19:18
  • @Rayalot72 just out of curiosity, does the behaviour still happen if you run it on the same drive that python and the script are stored? I see you using a `D://` drive and since this sounds far more likely to be an OS issue I wonder if it is just "windows sucks" or "your disk driver has an issue." – Tadhg McDonald-Jensen Jul 16 '20 at 19:19
  • I'd probably head over to our sister site [Super User](https://superuser.com/) and review [Why is my memory at 65% usage when I'm not running any programs?](https://superuser.com/questions/1338367) -- the answers go into figuring out if there's _really_ a program using that memory (and exactly what that program is), or if it's just cache. Also, the form in which the question is asked (including details about _where_ in the task manager it's displaying that the memory is used) is a pattern that's worth following; "available" is not an exact inverse of "used", so details matter. – Charles Duffy Jul 16 '20 at 19:38
  • @TadhgMcDonald-Jensen I don't believe so? I've just tried similar file I/O on my C and D drives, and it seems to increase no matter what. – Rayalot72 Jul 17 '20 at 18:53
  • @CharlesDuffy Thanks for the tip, the RAMMAP application was very useful. Seems the memory is being taken up by "mapped files." Is that normal, or something I should fix? – Rayalot72 Jul 17 '20 at 18:56
  • This depends on the details of how Windows works. Over in UNIX land, it's typical for memmapped files to just be a special case in the block cache code (where writes to the memory are allowed but need to be written back to the underlying file). You might ask a question over at Super User to get folks who are subject matter experts on Windows. BTW, if there _is_ an application holding a file descriptor open on a file that was only intended to be opened by an application that already closed, I'd suspect something like an antivirus to be a likely suspect. – Charles Duffy Jul 18 '20 at 18:13

1 Answers1

2

I know this question was asked two years ago, but I wanted to give an answer in case someone who came across this article wanted an answer. There are two major things wrong with this code snippet:

1, How your loop is constructed

The way your loop is constructed eats up a ton of memory. You're updating the same variable x 1000000 times, and due to the way Python updates variables you're creating a new variable each time you update x and assigning a new space in memory to it. A better way to construct this loop would be to use a for loop. Like this:

for x in range(0, 1000000):
     *do code*

Basically, the above for loop uses a generator to keep track of x, and terminates once x reaches 1000000. Generators are objects in Python that iterate over sequences of data without storing each individual item in memory. This makes them extremely useful for going over large ranges of numbers without putting too big a burden on your memory. In this case, "range(0, 1000000)" is the generator in question: "range(x, y)" is a constructor in Python that creates a generator that goes from the integer x to the integer y. The above code snippet is telling Python to *do code* for each item in the generator range(0, 1000000), which has 1000000 items: each integer from 0 to 1000000 not including 1000000. I highly suggest reading a little on generators in Python and the range() function, they're fundamental to know IMO.

2, What you're doing with each iteration of your loop

In the code you provided, you're telling python to run this line 1000000 times:

with open("".join([file, "\\", "junk", str(x), ".txt"]), "wt") as trash:
        trash.write("garbage")

This code opens a file, writes a single line to it, and closes the file afterwards. Problem is, if you're doing this 1000000 times, you're also opening and closing the same file 1000000 times! Which doesn't make sense, since you should only need to open the file once and close it once: open it when you start writing to it, and close it when you're done writing to it.The fix to this is very simple: simply put the "with open" statement before your loop. Which would look something like this:

with open("".join([file, "\\", "junk", str(x), ".txt"]), "wt") as trash:
     for i in range(0, 1000000):
        trash.write("garbage")

This way, you're opening your file first, then going through your for loop instead of opening and closing your file with each iteration of your loop. Once you've done these two fixes, your code should run fine.