relationship between container_memory_working_set_bytes and process_resident_memory_bytes and total_rss

Question

I'm looking to understanding the relationship of

container_memory_working_set_bytes vs process_resident_memory_bytes vs total_rss (container_memory_rss) + file_mapped so as to better equipped system for alerting on OOM possibility.

It seems against my understanding (which is puzzling me right now) given if a container/pod is running a single process executing a compiled program written in Go.

Why is the difference between container_memory_working_set_bytes is so big(nearly 10 times more) with respect to process_resident_memory_bytes

Also the relationship between container_memory_working_set_bytes and container_memory_rss + file_mapped is weird here, something I did not expect, after reading here

The total amount of anonymous and swap cache memory (it includes transparent hugepages), and it equals to the value of total_rss from memory.status file. This should not be confused with the true resident set size or the amount of physical memory used by the cgroup. rss + file_mapped will give you the resident set size of cgroup. It does not include memory that is swapped out. It does include memory from shared libraries as long as the pages from those libraries are actually in memory. It does include all stack and heap memory.

So cgroup total resident set size is rss + file_mapped how does this value is less than container_working_set_bytes for a container that is running in the given cgroup

Which make me feels something with this stats that I'm not correct.

Following are the PROMQL used to build the above graph

process_resident_memory_bytes{container="sftp-downloader"}
container_memory_working_set_bytes{container="sftp-downloader"}
go_memstats_heap_alloc_bytes{container="sftp-downloader"}
container_memory_mapped_file{container="sftp-downloader"} + container_memory_rss{container="sftp-downloader"}

how about go_memstats_heap_inuse_bytes? container_memory_working_set_bytes is more trustworthy metric anyway. — suiwenfeng, Jul 19 '21 at 04:33
Agreeed but the people are referring to watch for container_memory_rss and working_set both for OOM kill. And the underline think I intend to understand the relationship of rss vs working_set — Noobie, Jul 19 '21 at 17:08
FYI, https://medium.com/@eng.mohamed.m.saeed/memory-working-set-vs-memory-rss-in-kubernetes-which-one-you-should-monitor-8ef77bf0acee — suiwenfeng, Jul 20 '21 at 12:09

Noobie · Accepted Answer · 2021-07-23T13:07:13.613

8

So the relationship seems is like this

container_working_set_in_bytes = container_memory_usage_bytes - total_inactive_file

container_memory_usage_bytes as its name implies means the total memory used by the container (but since it also includes file cache i.e inactive_file which OS can release under memory pressure) substracting the inactive_file gives container_working_set_in_bytes

Relationship between container_memory_rss and container_working_sets can be summed up using following expression

container_memory_usage_bytes = container_memory_cache + container_memory_rss

cache reflects data stored on a disk that is currently cached in memory. it contains active + inactive file (mentioned above)

This explains why the container_working_set was higher.

Ref #1

Ref #2

edited Jul 23 '21 at 13:07

answered Jul 23 '21 at 12:34

Noobie

461
1
12
34

1

Not sure if I am reading the equations incorrectly, but it seems that the equations could be reformatted as `container_working_set_bytes = container_memory_rss + total_active_file` `container_memory_usage_bytes = container_memory_rss + total_inactive_file + total_active_file` Which means, `conaainer_working_set_bytes` should always be smaller than `container_memory_usage_bytes`. – bluefog Dec 21 '21 at 11:43
Yes, correct. Since container_memory_usage_bytes also include caches in a nutshell container_working_set_bytes is what RSS in real term – Noobie Dec 28 '21 at 10:06
@bluefog it seems like you missed container_memory_swap and kernel memory in your equations. As I understand this, container_working_set_bytes = container_memory_rss + total_active_file + container_memory_swap + kernel memory and container_memory_usage_bytes = container_memory_rss + total_inactive_file + total_active_file + container_memory_swap + kernel memory – WildWind03 Feb 17 '22 at 04:36

score 0 · Answer 2 · answered Jul 07 '21 at 16:54

0

Not really an answer, but still two assorted points.

Does this help to make sense of the chart?

Here at my $dayjob, we had faced various different issues with how different tools external to the Go runtime count and display memory usage of a process executing a program written in Go.
Coupled with the fact Go's GC on Linux does not actually release freed memory pages to the kernel but merely madvise(2)s it that such pages are MADV_FREE, a GC cycle which had freed quite a hefty amount of memory does not result in any noticeable change of the readings of the "process' RSS" taken by the external tooling (usually cgroups stats).

Hence we're exporting our own metrics obtained by periodically calling runtime.ReadMemStats (and runtime/debug.ReadGCStats) in any major serivice written in Go — with the help of a simple package written specifically for that. These readings reflect the true idea of the Go runtime about the memory under its control.

By the way, the NextGC field of the memory stats is super useful to watch if you have memory limits set for your containers because once that reading reaches or surpasses your memory limit, the process in the container is surely doomed to be eventually shot down by the oom_killer.

answered Jul 07 '21 at 16:54

kostix

51,517
14
93
176

Well based on that article I thought container_memory_rss + mapped_file would be the true RSS of the `cgroup` which I assumed to higher than `container_memory_working_set_bytes` but actually, it was not. – Noobie Jul 08 '21 at 03:46
Also if I'm correct golang 1.16 drop support for madvice. – Noobie Jul 16 '21 at 03:20
@Noobie, no, it uses another flag with `madvise(2)`: [`MADV_DONTNEED` instead of `MADV_FREE`](https://golang.org/doc/go1.16#runtime) — essentially reverting the behaviour [introduced in Go 1.12](https://golang.org/doc/go1.12#runtime). Unfortunately the [man page](https://manpages.debian.org/2/madvise) is not crystal clear on what happens to the observable RSS reading of a process which marked a page with `madvise(MADV_DONTNEED)`. – kostix Jul 16 '21 at 09:01
If I reading it correct it states this `On Linux, the runtime now defaults to releasing memory to the operating system promptly (using MADV_DONTNEED), rather than lazily when the operating system is under memory pressure (using MADV_FREE)`. **This means process-level memory statistics like RSS will more accurately reflect the amount of physical memory being used by Go processes** and `The resident set size (RSS) of the calling process will be immediately reduced` however.` – Noobie Jul 16 '21 at 09:08

relationship between container_memory_working_set_bytes and process_resident_memory_bytes and total_rss

2 Answers2

Linked