1

My team develops a complex multiprocess C++ based system running on Embedded Linux. Since there is no swap partition, a gradually growing memory leak can cause major trouble. (Let's assume for the sake of this discussion that all memory allocated in the system is filled with nonzero data.)

Now, as answered (tersely) here, when the operating system is out of RAM and has no swap, it discards clean pages. As far as I understand the only "clean" pages in this situation are those containing const data and currently/recently executing code from the Linux environment and particularly our executables and shared libraries, which may be harmlessly discarded and later reloaded from the filesystem as needed.

At first, the least recently used pages would be the first to go so this is hardly noticed but as more and more memory is allocated and the amount of wiggle room is reduced, code that is required more often gets swapped out then back in. The system starts to silently and invisibly thrash, but the only sign we see is the system becoming slower and less responsive, until eventually the kernel's oom-killer steps in and does its thing.

This situation doesn't necessarily require a memory leak to happen; it can happen simply because the natural memory requirements of our software exceeds available RAM. Such a situation is even harder to catch because the system won't crash, and the performance hit caused by the thrashing is not always immediately noticeable and can be confused with other reasons for bad performance (such as an inefficient algorithm).

I'm looking for a way to catch and flag this issue unambiguously before performance starts getting hit; ideally I'd like to monitor the amount of clean page discards that occur, hopefully without requiring a specially rebuilt kernel. Then I can establish some threshold beyond which an error will be raised. Of course any better ideas will be appreciated too.

I've tried other solutions such as monitoring process memory usage with top, or having processes self-police themselves with mallinfo(3) but still this doesn't catch all situations or clearly answer the question of what the overall memory usage status is. Another thing I've looked at is the "free" column in the output of free but that can display a low value whether or not thrashing is taking place.

itaych
  • 644
  • 5
  • 18
  • You can't really, not as a user-mode application anyway. And in fact it's even worse than what he described, Linux doesn't even tell you there's no memory when you try to allocate. It will literally lie to you and return a non-null pointer from `malloc`, despite it not having enough memory to back it up, and will only throw an exception when you *use* the memory returned to you. – Blindy Jan 29 '18 at 16:11
  • @Blindy that is true for the default configuration, but the kernel can simply be configured not to overcommit on memory. – Frank Meerkötter Jan 30 '18 at 05:26
  • @itaych why do you think the output of free is wrong? Can you elaborate? – Frank Meerkötter Jan 30 '18 at 05:34
  • @Blindy - I didn't mean to get user level API access to the innards of kswapd or whatever module is responsible for this paging. I meant to read this statistic from some file, e.g. /proc/vmstat or similar. There are probably thousands of statistical values across dozens or hundreds of readable virtual text files, surely one of them contains what I'm looking for? – itaych Jan 30 '18 at 09:40
  • @FrankMeerkötter - Here is what happens in my experiments: Suppose I'm on a 1 GB system and 980 GB are used, 'free' will indicate 20 MB free. Then if I allocate+fill another 100 MB, the system will (apparently) discard 100 MB of clean pages and 'free' will still say 20 MB free. This isn't strictly wrong - just misleading and missing crucial information (which is that 100 MB have just been paged out to keep that memory free!). I can repeat this 5 or 6 times before the oom-killer kills the memory hungry process, and until then 'free' will continue to happily and uselessly report 20 MB free. – itaych Jan 30 '18 at 09:40
  • @itaych: which of the values reported by free are you referring to? Are caches and buffers for example already included or not? – Frank Meerkötter Jan 30 '18 at 17:12
  • @FrankMeerkötter: With the production app running, this is 'free' before and after my memory eater allocs 400MB: total used free shared buffers cached Mem: 2066228 2050084 16144 972068 2616 1395008 -/+ buffers/cache: 652460 1413768 Swap: 0 0 0 total used free shared buffers cached Mem: 2066228 2048388 17840 972076 132 985972 -/+ buffers/cache: 1062284 1003944 Swap: 0 0 0 – itaych Jan 31 '18 at 09:11
  • @FrankMeerkötter: sorry the formatting messed up in my previous comment. The value I am referring to, as said in my question, is "free". I can't see any other value approaching something that looks critically high or low as I run out of RAM and more things get swapped out. ('buffers' becomes low but doesn't refer to what I'm looking for.) – itaych Jan 31 '18 at 09:17

2 Answers2

2

Alex's answer pointed me in the right direction by mentioning page faults, but the more specific answer is major page faults. From the perf_event_open(2) man page:

 PERF_COUNT_SW_PAGE_FAULTS_MAJ

This counts the number of major page faults. These required disk I/O to handle.

So while these are not the clean page discards I asked for, they are their corollary - they indicate when something that was previously swapped out, gets swapped back in from disk. In a swapless system the only thing that can get swapped in from disk are clean pages. In my tests I've found that these faults are normally few and far between but suddenly spike when memory is low (on my system it's something like 3 or more faults per second for over 5 consecutive seconds), and this indication is consistent with the system becoming slower and less responsive.

As for actually querying this statistic, this was answered in Measure page faults from a c program but I recommend starting from the code example from the bottom of the perf_event_open(2) man page (see link above) with this change:

pe.type = PERF_TYPE_SOFTWARE;
pe.config = PERF_COUNT_SW_PAGE_FAULTS_MAJ;

Assuming you want to get a system-wide statistic and not just pertaining to the current process, change the actual open line to:

fd = perf_event_open(&pe, -1, cpu, -1, 0);

The cpu argument here is tricky. On a single core single CPU system just set it to 0. Otherwise you will have to open a separate performance counter (with a separate fd) for each core, read them all and sum up their results. For a thread explaining why see here. It is easiest to get the number of cores using get_nprocs(3).

itaych
  • 644
  • 5
  • 18
1

I think the metric you're looking for is page faults. As an absolute value it cannot tell you anything since page faults are a normal part of system operation, but as a relative value, perhaps it can be useful: if you graph the amount of page faults your program is generating at different levels of memory usage, I bet there's going to be a significant jump at the point where your program exceeds the available RAM and starts this clean page discarding & loading behavior.

  • Actually it's _major_ page faults. I've detailed my findings in a separate answer. This helped a lot, thanks! – itaych Feb 04 '18 at 12:20