My team develops a complex multiprocess C++ based system running on Embedded Linux. Since there is no swap partition, a gradually growing memory leak can cause major trouble. (Let's assume for the sake of this discussion that all memory allocated in the system is filled with nonzero data.)
Now, as answered (tersely) here, when the operating system is out of RAM and has no swap, it discards clean pages. As far as I understand the only "clean" pages in this situation are those containing const data and currently/recently executing code from the Linux environment and particularly our executables and shared libraries, which may be harmlessly discarded and later reloaded from the filesystem as needed.
At first, the least recently used pages would be the first to go so this is hardly noticed but as more and more memory is allocated and the amount of wiggle room is reduced, code that is required more often gets swapped out then back in. The system starts to silently and invisibly thrash, but the only sign we see is the system becoming slower and less responsive, until eventually the kernel's oom-killer steps in and does its thing.
This situation doesn't necessarily require a memory leak to happen; it can happen simply because the natural memory requirements of our software exceeds available RAM. Such a situation is even harder to catch because the system won't crash, and the performance hit caused by the thrashing is not always immediately noticeable and can be confused with other reasons for bad performance (such as an inefficient algorithm).
I'm looking for a way to catch and flag this issue unambiguously before performance starts getting hit; ideally I'd like to monitor the amount of clean page discards that occur, hopefully without requiring a specially rebuilt kernel. Then I can establish some threshold beyond which an error will be raised. Of course any better ideas will be appreciated too.
I've tried other solutions such as monitoring process memory usage with top
, or having processes self-police themselves with mallinfo(3)
but still this doesn't catch all situations or clearly answer the question of what the overall memory usage status is. Another thing I've looked at is the "free" column in the output of free
but that can display a low value whether or not thrashing is taking place.