Malloc is using 10x the amount of memory necessary

Question

I have a network application which allocates predicable 65k chunks as part of the IO subsystem. The memory usage is tracked atomically within the system so I know how much memory I'm actually using. This number can also be checked against malloc_stats()

Result of malloc_stats()

Arena 0:
system bytes     =    1617920
in use bytes     =    1007840
Arena 1:
system bytes     = 2391826432
in use bytes     =  247265696
Arena 2:
system bytes     = 2696175616
in use bytes     =  279997648
Arena 3:
system bytes     =    6180864
in use bytes     =    6113920
Arena 4:
system bytes     =   16199680
in use bytes     =     699552
Arena 5:
system bytes     =   22151168
in use bytes     =     899440
Arena 6:
system bytes     =    8765440
in use bytes     =     910736
Arena 7:
system bytes     =   16445440
in use bytes     =   11785872
Total (incl. mmap):
system bytes     =  935473152
in use bytes     =  619758592
max mmap regions =         32
max mmap bytes   =   72957952

Items to note:

The total in use bytes is completely correct number according to my internal counter. However, the application has a RES (from top/htop) of 5.2GB. The allocations are almost always 65k; I don't understand the huge amount of fragmentation/waste I am seeing even more so when mmap comes into play.
total system bytes does not equal to the sum of system bytes in each Arena.
I'm on Ubuntu 16.04 using glibc 2.23-0ubuntu3
Arena 1 and 2 account for the large RES value the kernel is reporting.
Arena 1 and 2 are holding on to 10x the amount of memory that is used.
The mass majority of allocations are ALWAYS 65k (explicit multiple of the page size)

How do I keep malloc for allocating an absurd amount of memory?

I think this version of malloc has a huge bug. Eventually (after an hour) a little more than half of the memory will be released. This isn't a fatal bug but it is definitely a problem.

UPDATE - I added mallinfo and re-ran the test - the app is no longer processing anything at the time this was captured. No network connections are attached. It is idle.

Arena 2:
system bytes     = 2548473856
in use bytes     =    3088112
Arena 3:
system bytes     = 3288600576
in use bytes     =    6706544
Arena 4:
system bytes     =   16183296
in use bytes     =     914672
Arena 5:
system bytes     =   24027136
in use bytes     =     911760
Arena 6:
system bytes     =   15110144
in use bytes     =     643168
Arena 7:
system bytes     =   16621568
in use bytes     =   11968016
Total (incl. mmap):
system bytes     = 1688858624
in use bytes     =   98154448
max mmap regions =         32
max mmap bytes   =   73338880
arena (total amount of memory allocated other than mmap)                 = 1617780736
ordblks (number of ordinary non-fastbin free blocks)                     =       1854
smblks (number of fastbin free blocks)                                   =         21
hblks (number of blocks currently allocated using mmap)                  =         31
hblkhd (number of bytes in blocks currently allocated using mmap)        =   71077888
usmblks (highwater mark for allocated space)                             =          0
fsmblks (total number of bytes in fastbin free blocks)                   =       1280
uordblks (total number of bytes used by in-use allocations)              =   27076560
fordblks (total number of bytes in free blocks)                          = 1590704176
keepcost (total amount of releaseable free space at the top of the heap) =     439216

My hypothesis is as follows: The difference between the total system bytes reported by malloc is much less than the amount reported in each arena. (1.6Gb vs 6.1GB) This could mean that (A) malloc is actually releasing blocks but the arena doesn't or (B) that malloc is not compacting memory allocations at all and it is creating huge amount of fragmentation.

UPDATE Ubuntu released a kernel update which basically fixed everything as described in this post. That said, there is a lot of good information in here on how malloc works with the kernel.

Just because you release the memory doesn't mean the kernel will unmap the pages from your process. The virtual memory will be marked as free though, and can be reused when and if needed. — Some programmer dude, Sep 28 '16 at 16:47
Try `pmap -x ` and see whether there are unexpected memory mappings. It also shows you which mappings contribute to RSS. — Maxim Egorushkin, Sep 28 '16 at 16:51
@JoachimPileborg malloc_stats() is showing the memory as free. (the "system bytes" minus the "in use bytes" should be the free memory. You have to look at every Arena and see the difference. Arena 1 and 2 are holding on to 2GB more memory than what is used. — Johnny V, Sep 28 '16 at 16:53
Well it's actually impossible for us to do anything but guess, since we have no idea what's happening in your code. The only one who has all the information needed to debug this issue is you. First of all try to minimize the code to the barest minimum to cause such a problem, use memory debuggers such as [Valgrind](http://valgrind.org/) but also step through the code with an ordinary debugger. That's all the advice I can give you. — Some programmer dude, Sep 28 '16 at 16:58
@JoachimPileborg I manually counted all the RES memory from `pmap -x ` and the total kB (at the bottom) says 5466224 and my calculator says 2120980. Something is wrong. — Johnny V, Sep 28 '16 at 17:09
Can you perhaps modify your code to use [the `mallinfo` function](http://man7.org/linux/man-pages/man3/mallinfo.3.html) instead? Of special interest would be the `arena`, `uordblks` and `fordblks` members of the `mallinfo` structure. For example, does `uordblks + fordblks` equal `arena`? — Some programmer dude, Sep 28 '16 at 17:34
"I think this version of malloc has a huge bug." That's not what *usually* happens. — n. m. could be an AI, Sep 28 '16 at 17:36
@MaximEgorushkin the output is too large for StackOverflow http://pastebin.com/BDyRzi5P — Johnny V, Sep 28 '16 at 17:38
@JohnnyV Your manual calculations are incorrect, the numbers correctly add up to total. Try `grep -v total ~/Downloads/BDyRzi5P.txt | awk '$4 ~ /[0-9]+/ {n += $4} END {print n}' ` — Maxim Egorushkin, Sep 28 '16 at 17:48
If all else fails, try `mtrace()` and see if you can spot some unaccounted-for allocations. — n. m. could be an AI, Sep 28 '16 at 18:01
@JoachimPileborg post was updated; I'm even more confused now. — Johnny V, Sep 28 '16 at 18:32
The memory allocator is in `glibc` and is standard across all distros. It has a lot of mileage on it, so it is unlikely to be buggy. Your allocations could be fragmenting memory. You could be leaking memory. You could use `mtrace`, write your own `malloc` hook functions, or call wrappers (e.g. change all calls to `malloc` to `mytrace_malloc`--this can be done with CPP macros, so no source edits, or using `LD_PRELOAD`). Consider `mcheck`. java okay [sort of], but JNI? Consider faking I/O so program runs 100x faster to get to error state in a minute instead of an hour. Post your code!? — Craig Estey, Sep 28 '16 at 18:32
@CraigEstey - that is an understandable position. I just updated the post with the info from `mallinfo` and the numbers don't add up. The per-arena system bytes is much much larger than what malloc thinks is the total amount of allocated memory. I've seen this behavior come and go as Ubuntu is updated. — Johnny V, Sep 28 '16 at 18:38
As I suspected, it seems that the C library memory allocation system keeps (caches?) memory pages. Exactly what's going on here I don't know since I don't know the internals of the allocator. But it seems the pages are kept internally by the allocator as "bytes in free blocks". It also seems that if and when the memory needs to be reclaimed the allocator releases those pages (going by your description). I'd say it's nothing to worry about, and that it works as expected. — Some programmer dude, Sep 28 '16 at 19:13
@JoachimPileborg What do you think about the huge difference between the reported total values and that of the values inside of each arena? It is my understanding the arena system used is supposed to equal the total system used. I think that malloc is failing to free memory from the arenas. — Johnny V, Sep 28 '16 at 19:16
I'm familiar with your issues. I've written allocators from scratch. I've also done what I've suggested to track leaks and frags. If you get variance update-to-update, it might be your program has UB, but you get "lucky" on given revs. Consider writing your own hooks. Have the hooks do `mallinfo` and/or `malloc_stats` [you can divert `stderr` to grab its output] on each call. Does an arena creep up slowly or does one alloc cause a huge jump? The hook could log the `caller` to identify the culprit. — Craig Estey, Sep 28 '16 at 19:17
Try `malloc_trim(0)` and see whether that releases memory to the OS. — Maxim Egorushkin, Sep 28 '16 at 19:25
`malloc` will only release memory from the arena if the arena is 100% unused [AFAIK] and only when it feels like it [needs to]. It uses anonymous `mmap` to get a new area from the OS. The reverse is an `munmap`. If an arena is created with (e.g.) 1MB, and only 20 bytes are used, it can't unmap a partial area. — Craig Estey, Sep 28 '16 at 19:33
Also, IIRC, the kernel does anon mmap by mapping all pages to the R/O zero page. Only when a process writes to a page, generating a protection fault, does the kernel break the many-to-one mapping and assign a unique page [and restarts the write]. See my answer: http://stackoverflow.com/questions/37172740/how-does-mmap-improve-file-reading-speed/37173063#37173063 and also the answers linked there for more details. — Craig Estey, Sep 28 '16 at 19:35
@MaximEgorushkin @CraigEstey `malloc_trim(0)` actually frees all the memory. this is bizzare. — Johnny V, Sep 28 '16 at 19:57
I downloaded your test program, but it's missing the `net.emoten.*`. But, just because your program frees all its allocations doesn't help. In java, _any_ variable is heap allocated by the JVM (e.g. `Field field` in `loadUnsafe`) via `malloc`. This is also subject to the whims of the JVM's GC. So, declaring _any_ variable could hold an arena because the GC hasn't collected it. Or, if the GC did, it make keep its own reuse pool (i.e. doesn't call free) and only periodically do `free`. If your test pgm is absolute minimum, then rewrite it in 100% C and see what you get. — Craig Estey, Sep 28 '16 at 19:59
@CraigEstey Just remove the dep for `net.emoten.*` and the rest should be portable. Otherwise I have to send you the jar. The Oracle JVM does not pool any memory called via Unsafe. It is a raw pass through for malloc and free. — Johnny V, Sep 28 '16 at 20:01
If `malloc_trim` works, then `malloc` is fine, doing its job, nothing to worry about. Remember when I said "only when it feels like it". Normally true, but that's `malloc_trim` is for--to tell `malloc` to "feel like it" now! — Craig Estey, Sep 28 '16 at 20:02
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/124465/discussion-between-johnny-v-and-craig-estey). — Johnny V, Sep 28 '16 at 20:05
Ugh, in the test code, where do you imagine jvm is getting memory for the array that you grow and shrink inbetwixt your indirect malloc calls? And what if random returns zero? — kfsone, Sep 28 '16 at 21:12
**Does your application create many threads?** The thread stacks will contribute to the VM footprint size. By default they are two megabytes each, or something like that. A thousand threads and that's two gigabytes. — Kaz, Sep 28 '16 at 21:12
@kfsone The jvm is pulling from the eden space of the heap. The entire committed heap was 245MB. You can tell the exact size of the used and unused heap as well as committed memory so the jvm isn't a contributing factor. I'm also using Unsafe for some kind of portability instead of my bypass. Check the chat for the answer. — Johnny V, Sep 28 '16 at 21:20
Use [valgrind](http://valgrind.org/) to check against memory leaks — Basile Starynkevitch, Sep 28 '16 at 23:31

score 12 · Accepted Answer · answered Sep 29 '16 at 05:10

The full details can be a bit complex, so I'll try to simplify things as much as I can. Also, this is a rough outline and may be slightly inaccurate in places.

Requesting memory from the kernel

malloc uses either sbrk or anonymous mmap to request a contiguous memory area from the kernel. Each area will be a multiple of the machine's page size, typically 4096 bytes. Such a memory area is called an arena in malloc terminology. More on that below.

Any pages so mapped become part of the process's virtual address space. However, even though they have been mapped in, they may not be backed up by a physical RAM page [yet]. They are mapped [many-to-one] to the single "zero" page in R/O mode.

When the process tries to write to such a page, it incurs a protection fault, the kernel breaks the mapping to the zero page, allocates a real physical page, remaps to it, and the process is restarted at the fault point. This time the write succeeds. This is similar to demand paging to/from the paging disk.

In other words, page mapping in a process's virtual address space is different than page residency in a physical RAM page/slot. More on this later.

RSS (resident set size)

RSS is not really a measure of how much memory a process allocates or frees, but how many pages in its virtual address space have a physical page in RAM at the present time.

If the system has a paging disk of 128GB, but only had (e.g.) 4GB of RAM, a process RSS could never exceed 4GB. The process's RSS goes up/down based upon paging in or paging out pages in its virtual address space.

So, because of the zero page mapping at start, a process RSS may be much lower than the amount of virtual memory it has requested from the system. Also, if another process B "steals" a page slot from a given process A, the RSS for A goes down and goes up for B.

The process "working set" is the minimum number of pages the kernel must keep resident for the process to prevent the process from excessively page faulting to get a physical memory page, based on some measure of "excessively". Each OS has its own ideas about this and it's usually a tunable parameter on a system-wide or per-process basis.

If a process allocates a 3GB array, but only accesses the first 10MB of it, it will have a lower working set than if it randomly/scattershot accessed all parts of the array.

That is, if the RSS is higher [or can be higher] than the working set, the process will run well. If the RSS is below the working set, the process will page fault excessively. This can be either because it has poor "locality of reference" or because other events in the system conspire to "steal" the process's page slots.

malloc and arenas

To cut down on fragmentation, malloc uses multiple arenas. Each arena has a "preferred" allocation size (aka "chunk" size). That is, smaller requests like malloc(32) come from (e.g.) arena A, but larger requests like malloc(1024 * 1024) come from a different arena (e.g.) arena B.

This prevents a small allocation from "burning" the first 32 bytes of the last available chunk in arena B, making it too short to satisfy the next malloc(1M)

Of course, we can't have a separate arena for each requested size, so the "preferred" chunk sizes are typically some power of 2.

When creating a new arena for a given chunk size, malloc doesn't just request an area of the chunk size, but some multiple of it. It does this so it can quickly satisfy subsequent requests of the same size without having to do an mmap for each one. Since the minimum size is 4096, arena A will have 4096/32 chunks or 128 chunks available.

free and munmap

When an application does a free(ptr) [ptr represents a chunk], the chunk is marked as available. free could choose to combine contiguous chunks that are free/available at that time or not.

If the chunk is small enough, it does nothing more (i.e.) the chunk is available for reallocation, but, free does not try to release the chunk back to the kernel. For larger allocations, free will [try to] do munmap immediately.

munmap can unmap a single page [or even a small number of bytes], even if comes in the middle of an area that was multiple pages long. If so, the application now has a "hole" in the mapping.

malloc_trim and madvise

If free is called, it probably calls munmap. If an entire page has been unmapped, the RSS of the process (e.g. A) goes down.

But, consider chunks that are still allocated, or chunks that were marked as free/available but were not unmapped.

They are still part of the process A's RSS. If another process (e.g. B) starts doing lots of allocations, the system may have to page out some of process A's slots to the paging disk [reducing A's RSS] to make room for B [whose RSS goes up].

But, if there is no process B to steal A's page slots, process A's RSS can remain high. Say process A allocated 100MB, used it a while back, but is only actively using 1MB now, the RSS will remain at 100MB.

That's because without the "interference" from process B, the kernel had no reason to steal any page slots from A, so they "remain on the books" in the RSS.

To tell the kernel that a memory area is not likely to be used soon, we need the madvise syscall with MADV_WONTNEED. This tells the kernel that the memory area is low priority and it should [more] aggressively page it out to the paging disk, thereby reducing the process's RSS.

The pages remain mapped in the process's virtual address space, but get farmed out to the paging disk. Remember, page mapping is different than page residency.

If the process accesses the page again, it incurs a page fault and the kernel will pull in the data from paging disk to a physical RAM slot and remap. The RSS goes back up. Classical demand paging.

madvise is what malloc_trim uses to reduce the RSS of the process.

_If free is called, it probably calls `munmap`_ - more like it does not because it is not required to. — Maxim Egorushkin, Sep 29 '16 at 10:11
Why would the kernel move a page to disk if it was released? — Johnny V, Sep 29 '16 at 12:01
@MaximEgorushkin It's it the `glibc` source [which I had been looking at when I was writing the answer]. If the chunk was mmap'ed, `free` _immediately_ calls `munmap`. It only does the more traditional chunk combine/split for the `sbrk` allocated chunks — Craig Estey, Sep 29 '16 at 17:42
@CraigEstey Interesting. There would not be this question if it did unmap it for the OP... — Maxim Egorushkin, Sep 29 '16 at 17:44
@MaximEgorushkin OP's JNI allocs (65k) are small enough to use `sbrk`. `malloc` first tries to fill a request from existing areas, consolidating as possible/needed [huge amount of code for this]. If malloc needs to ask system for memory: An alloc of >1MB is `mmap` if number of already mmap'ed areas is below a threshold. Otherwise, it uses `sbrk`. But, if `sbrk` fails for some reason it falls back to `mmap`. `malloc` — Craig Estey, Sep 29 '16 at 19:26
@MaximEgorushkin To get a better baseline, OP may want to call `malloc_trim` after the JVM is "up and running", and has done some GC to minimize its effects. The stats could be large because JVM is holding the memory, or it's released into malloc free areas but not back to kernel. Then, monitor `malloc_stats`, etc. during JNI allocs. Rather than use `malloc_trim` everywhere, the "autotrim" threshold can be set with `mallopt` — Craig Estey, Sep 29 '16 at 19:45
The kernel doesn't page to disk if it doesn't have to. Only "dirty" pages [pages modified, but not yet written to the "backing store"]. If a page is unmapped, it can just be discarded/reused _if_ the backing store is the paging disk. If the backing store is a _file_ [e.g. mmap to a file for writing], an unmap _must_ write this page out before marking the page as reusable. — Craig Estey, Sep 29 '16 at 19:52
That pmap result shows that those 64kB allocations are made with mmap. — Maxim Egorushkin, Sep 29 '16 at 20:10
@MaximEgorushkin OP said 65k not 64k [so, typo or not?] I was reading from the glibc comments re. the 1MB. I just did an actual test program. `mmap` kicks in for allocations >128KB. I'm using glibc 2.21 [fc22], and this matches the manpage for `mallopt` and the `M_MMAP_THRESHOLD` default value. So, it seems somebody (JVM?) may be setting a different threshold. — Craig Estey, Sep 29 '16 at 22:51
@CraigEstey Are we no longer sure that `mallinfo` reserved and free values are correct? The situation in the question was one where there was significantly more memory allocated than what even `malloc` thought was. — Johnny V, Sep 30 '16 at 17:15
If I set `M_MMAP_THRESHOLD` to a small value then the allocations will be immediately freed. Right now I'm setting it to 128K; I only added the ability to set that parameter yesterday. The JVM starts with only 187MB in RSS. — Johnny V, Sep 30 '16 at 17:20

Maxim Egorushkin · Answer 2 · 2016-09-29T12:17:15.637

2

free does not promise to return the freed memory to the OS.

What you observe is the freed memory is kept in the process for possible reuse. More than that, free releasing memory to the OS can pose a performance problem when allocation and deallocation of large chunks happen frequently. This is why there is an option to return the memory to the OS explicitly with malloc_trim.

Try malloc_trim(0) and see if that reduces the RSS. This function is non-standard, so its behaviour is implementation specific, it might not do anything at all. You mentioned in the comments that calling it did reduce RSS.

You may like to make sure that there are no memory leaks and memory corruption before you start digging deeper.

With regards to keepcost member, see man mallinfo:

BUGS

Information is returned for only the main memory allocation area. Allocations in other arenas are excluded. See malloc_stats(3) and malloc_info(3) for alternatives that include information about other arenas.

edited Sep 29 '16 at 12:17

answered Sep 28 '16 at 20:07

Maxim Egorushkin

131,725
17
180
271

The problem right now is that `malloc` reports that a small amount of memory can be released via `malloc_trim` however several gigabytes is actually released. This leads me to believe that the counter is broken somehow and the automatic call to `malloc_trim` is not working. – Johnny V Sep 28 '16 at 20:10
@JohnnyV _malloc reports that a small amount of memory can be released_ - how does it report that? – Maxim Egorushkin Sep 29 '16 at 09:32
`mallinfo..keepcost` tells you how much free memory is at the top of the heap. However, `malloc_trim` does a lot more than release just the memory listed in `keepcost`. Craig will post a better explanation. – Johnny V Sep 29 '16 at 11:52
@JohnnyV Craig's explanation is longer but it also boils down to _`free` does not promise to return the freed memory to the OS_. This is all there is to it. – Maxim Egorushkin Sep 29 '16 at 12:05
Something of note: `MMPAP_THRESHOLD` is a very interesting parameter for `malloc` because all allocations above that value will be fully returned to the kernel upon `free` – Johnny V Sep 30 '16 at 17:07
It seems like `free` doesn't even necessarily return empty pages to the kernel either. The only time its guaranteed is if it is above the `M_MMAP_THRESHOLD` – Johnny V Sep 30 '16 at 17:24
1

@JohnnyV, malloc_trim has undocumented feature since 2007 (glibc 2.9) to `madvise(..MADV_DONTNEED)` for every free page inside the heap (in the middle. This is not documented by glibc authors (and not accounted in malloc info/stats) and not documented by author of man page malloc_trim, check http://stackoverflow.com/questions/28612438/can-malloc-trim-release-memory-from-the-middle-of-the-heap/42273711. And yes, there **are no any automatic calls to malloc_trim** in glibc malloc/malloc, it is only comment "When malloc_trim is called automatically from free". – osgx Feb 16 '17 at 19:14

Malloc is using 10x the amount of memory necessary

2 Answers2

Linked