memory management in python using memory_profiler

Question

While reading an article on memeory management in python came across few doubts:

import copy
import memory_profiler

@profile
def function():
    x = list(range(1000000))  # allocate a big list
    y = copy.deepcopy(x)
    del x
    return y

if __name__ == "__main__":
    function()

$:python -m memory_profiler memory_profiler_demo.py
Filename: memory_profiler_demo.py

Line #    Mem usage    Increment   Line Contents
================================================
     4   30.074 MiB   30.074 MiB   @profile
     5                             def function():
     6   61.441 MiB   31.367 MiB       x = list(range(1000000))  # allocate a big list
     7  111.664 MiB   50.223 MiB       y = copy.deepcopy(x)#doubt 1
     8  103.707 MiB   -7.957 MiB       del x #doubt 2
     9  103.707 MiB    0.000 MiB       return

so i have the doubts on line 7 why it took more size to copy the list and second doubt on line 8 why it only frees 7 MiB.

The reason it only frees 2MB is that once you allocate a bunch of memory, Python and your OS and/or malloc library both guess that you're likely to allocate a bunch of memory again. On modern platforms, it's a lot faster to reuse that memory in-process than to release it and reallocate it from scratch, while it costs very little to keep extra unused pages of memory in your process's space, so it's usually the right tradeoff. — abarnert, Mar 25 '18 at 19:33
Another question about the same article: [Python: Cannot replicate a test on memory usage](//stackoverflow.com/q/51030849) — Martijn Pieters, Jun 27 '18 at 16:09

abarnert · Accepted Answer · 2018-03-25T20:32:43.490

1

First, let's start with why line 8 only frees 7MiB.

Once you allocate a bunch of memory, Python and your OS and/or malloc library both guess that you're likely to allocate a bunch of memory again. On modern platforms, it's a lot faster to reuse that memory in-process than to release it and reallocate it from scratch, while it costs very little to keep extra unused pages of memory in your process's space, so it's usually the right tradeoff. (But of course usually != always, and the blog you linked seems to be in large part about how to work out that you're building an application where it's not the right tradeoff and what to do about it.)

A default build of CPython on Linux virtually never releases any memory. On other POSIX (including Mac) it almost never releases any memory. On Windows, it does release memory more often—but there are still constraints. Basically, if a single allocation from Windows has any piece in use (or even in the middle of a freelist chain), that allocation can't be returned to Windows. So, if you're fragmenting memory (which you usually are), that memory can't be freed. The blog post you linked to explains this to some extent, and there are much better resources than an SO answer to explain further.

If you really do need to allocate a lot of memory for a short time, release it, and never use it again, without holding onto all those pages, there's a common Unix idiom for that—you fork, then do the short-term allocation in the child and exit after passing back the small results in some way. (In Python, that usually means using multiprocessing.Process instead of os.fork directly.)

Now, why does your deepcopy take more memory than the initial construction?

I tested your code on my Mac laptop with python.org builds of 2.7, 3.5, and 3.6. What I found was that the list construction takes around 38MiB (similar to what you're seeing), while the copy takes 42MiB on 2.7, 31MiB on 3.5, and 7MB on 3.6.

Slightly oversimplified, here's the 2.7 behavior: The functions in copy just call the type's constructor on an iterable of the elements (for copy) or recursive copies of them (for deepcopy). For list, this means creating a list with a small starting capacity and then expanding it as it appends. That means you're not just creating a 1M-length array, you're also creating and throwing away arrays of 500K, 250K, etc. all the way down. The sum of all those lengths is equivalent to a 2M-length array. Of course you don't really need the sum of all of them—only the most recent array and the new one are ever live at the same time—but there's no guarantee the old arrays will be freed in a useful way that lets them get reused. (That might explain why I'm seeing about 1.5x the original construction while you're seeing about 2x, but I'd need a lot more investigation to bet anything on that part…)

In 3.5, I believe the biggest difference is that a number of improvements over the 5 years since 2.7 mean that most of those expansions now get done by realloc, even if there is free memory in the pool that could be used instead. That changes a tradeoff that favored 32-bit over 64-bit on modern platforms into one that works the other way round—in 64-bit linux/Mac/Windows: there are often going to be free pages that can be tossed onto the end of an existing large alloc without remapping its address, so most of those reallocs mean no waste.

In 3.6, the huge change is probably #26167. Oversimplifying again, the list type knows how to copy itself by allocating all in one go, and the copy methods now take advantage of that for list and a few other builtin types. Sometimes there's no reallocation at all, and even when there is, it's usually with the special-purpose LIST_APPEND code (which can be used when you can assume nobody outside the current function has access to the list yet) instead of the general-purpose list.append code (which can't).

edited Mar 25 '18 at 20:32

answered Mar 25 '18 at 20:21

abarnert

354,177
51
601
671

i also tested on python 3.6 on linux it (7MB on 3.6.) why so efficient on that – Roushan Mar 26 '18 at 06:15
@Roushan45: That's explained in the answer: in 3.6, the `copy` functions special-case the optimizations that `list`, `dict`, etc. provide. I linked to the issue # in b.p.o, and you should be able to find deeper discussion and the github change to CPython source from there. – abarnert Mar 26 '18 at 06:18
The 'huge change' is not #26167 (which concentrated on *shallow* copies, not `copy.deepcopy`, where the patch only optimised the `list` copy loop by using local names). The memory manager is not asking for as much additional heap memory, so the process doesn't jump in size for overallocated memory. The `psutil` approach of `memory_profiler` is hugely inaccurate; in 3.x use the `--backend tracemalloc` backend to get much more accurate data. The article the OP links to has some major issues, see [Python: Cannot replicate a test on memory usage](//stackoverflow.com/q/51030849). – Martijn Pieters Jun 27 '18 at 16:13
@MartijnPieters I thought the big memory manager change was in 3.5? Which change are you referring to that makes 3.6 not ask for as much additional heap memory as 3.5 for the same code? – abarnert Jun 27 '18 at 16:18
@MartijnPieters Meanwhile, agreed that `psutil` is near useless for measuring memory use like this, but I could spend a whole answer on why that is for each of the three major platforms without ever getting to a specific explanation of what the OP is actually asking about. Suggesting tracemalloc, on the other hand, probably is worth doing; I’ll edit when I get a chance. – abarnert Jun 27 '18 at 16:21
@abarnert: It's the result of incremental changes in the 3.x line, more likely, including such things as the memory-improvements for TimSort, key-sharing dictionaries, new string memory model, the improvements to marshall to keep shared references intact, PEP 445, bytes objects with only `0` values are smaller, the random is less memory hungry, the new compact dictionary implementation (the one that gives dicts insertion order now), etc. etc. etc. – Martijn Pieters Jun 27 '18 at 16:29
@MartijnPieters But many of those changes are in 3.3-3.5, so they would be reflected in the 3.5 behavior. Some of them might be more important than using realloc and 64-bit thinking (not trying to greedily use up pages), which was my guess in the answer at the biggest single cause for the difference between 2.7 and 3.5—but they’re not going to explain the difference between 3.5 and 3.6. Also, the OP’s test case doesn’t even touch many of the affected features—e.g., there are no large dicts or sorts within `memory_profiler` and none at all within his script. – abarnert Jun 27 '18 at 16:35
When you use tracemalloc (you'll need to create a patched 2.7 build) then no, there are *almost no changes in memory use* (there is in the pickle test, not 100% sure yet why 2.7 uses more memory, probably the protocol level). The differences is entirely coming from over-allocation elsewhere. However, I haven't run the tests with 3.5 yet (I just found my 3.5 buildout python to be broken, have to rebuild). – Martijn Pieters Jun 27 '18 at 16:40
And I don't see any differences between 3.5 and 3.6, not on OS X at any rate. The `memory_profiler` tests produce the same outputs, more or less, both with the default psutil and the tracemalloc backends. Only 2.7 is the outlier here (where the process grows by 5 MiB when the `deepcopy()` line is hit). – Martijn Pieters Jun 27 '18 at 16:56
@MartijnPieters The fact that you didn't see any difference between 3.5 and 3.6 on one machine when I did, and also didn't see any jump when both the OP and I did, is like saying "I ran your code that crashes 50% of the time, and it didn't crash, so there's no bug." Unless you're suggesting that both he and I imagined the output (and that we're all imagining the copy-paste in the question), there is a real result here that you just didn't reproduce. – abarnert Jun 27 '18 at 17:13
@MartijnPieters Of course most likely it's a meaningless result (which is what the first half of my answer tries to explain, but I'm open to suggestions to improve it), but that's a different story. – abarnert Jun 27 '18 at 17:13
@abarnert: I was mostly letting you know that the article the OP points to is drawing the wrong conclusions, and I had missed your specific 3.5 and 3.6 results in your post. As you [point out elsewhere](https://stackoverflow.com/a/15508271/100297), we'd have to take the platform malloc into account if we want to fully understand what is going on. I mostly use tracemalloc as it gives much more useful and accurate information, anyway. I'll make sure to include platform details in my test info (I used MacOS 10.13). – Martijn Pieters Jun 27 '18 at 17:18
@MartijnPieters 99% of all blog posts about memory usage (not just by Python) are either complete garbage, or only make sense for 32-bit platforms (and usually not even in general, just 32-bit Linux <2.2, or Windows on a machine with under 1GB RAM, etc.). Maybe my answer should focus more on explaining why these results are meaningless for actually optimizing memory usage and less on explaining what’s going on? But if that’s _all_ that needs to be said, this question is really just a dup of all the similar ones (including the two you found). So: improve this answer, leave it, or close the Q? – abarnert Jun 27 '18 at 17:25
I don't really have an opinion on that atm, to be honest. – Martijn Pieters Jun 27 '18 at 22:20

memory management in python using memory_profiler

1 Answers1