3

In a numerical physics project of mine, I'd like to compare memory usage of different methods for solving the same problem. I've found out that I can include <sys/resource.h> and use getrusage() to get the maximum amount of used memory in ru_maxrss (with some caveats that I don't think I need to care about).

For benchmarking, I essentially run code blocks like these for all the different methods I've implemented:

int minN = 6;
int maxN = 16;
std::chrono::steady_clock::time_point start;
std::chrono::steady_clock::time_point finish;

std::cout << "Naive:" << std::endl;
for (int N = minN; N <= maxN; N+=2) {
    struct rusage usage{};
    start = std::chrono::steady_clock::now(); 
    //do work...
    finish = std::chrono::steady_clock::now();
    int ret = getrusage(RUSAGE_SELF, &usage);

    long time_ns = std::chrono::duration_cast<std::chrono::nanoseconds>(finish - start).count();
    long max_ram_byte = usage.ru_maxrss;
    std::cout << "N = " << N << ", time = " << time_ns/1e9 << " s, ram = " << max_ram_byte << " KB" << std::endl;
}

Now, the problem is that ru_maxrss contains the maximum amount of used memory for the whole lifetime of the program, i.e. it is not reduced if a "large" object goes out of scope. Thus, the output of the whole program will look something like this:

Naive:
N = 6, time = 0.022541 s, ram = 8028 KB
N = 8, time = 0.0234674 s, ram = 65360 KB
N = 10, time = 0.373676 s, ram = 135284 KB
N = 12, time = 21.7536 s, ram = 631792 KB
Magnetization:
N = 6, time = 0.000166585 s, ram = 631792 KB
N = 8, time = 0.00158378 s, ram = 631792 KB
N = 10, time = 0.022255 s, ram = 631792 KB
N = 12, time = 0.405172 s, ram = 631792 KB
Momentum:
N = 6, time = 0.000175482 s, ram = 631792 KB
N = 8, time = 0.000766058 s, ram = 631792 KB
N = 10, time = 0.00658272 s, ram = 631792 KB
N = 12, time = 0.0728279 s, ram = 631792 KB
Parity:
N = 8, time = 0.000986243 s, ram = 631792 KB
N = 12, time = 0.0528302 s, ram = 631792 KB
Spin Inversion:
N = 8, time = 0.00111167 s, ram = 631792 KB
N = 12, time = 0.050363 s, ram = 631792 KB

Once memory usage has peaked, the reported memory usage of my benchmark is useless. I realize that, in principle, this is how getrusage() is supposed to work. Is there a way to reset this metric? Or can anyone recommend another easy way to measure memory usage from inside the program that does not involve using specific benchmarking libraries?

Regards

PS: Does anyone know whether or in which cases ru_maxrss is in B or KB? For N = 8, I store a matrix with 65536 double elements. This matrix should dominate memory usage and I'd expect it to take up about 65536 Bytes of memory. My benchmark reports that I use 65360 KB, as the documentation of getrusage() says the result is in KB. This is eerily close to the estimated number of Bytes I was expecting. So is the result really in KB and this is purely a coincidence?

Update: I got what I wanted working parsing /proc/self/stat, I'll share my updated code below in case anyone finds this in the future. Note that rss, the 24th entry of stat is in pages, so one must multiply it by 4096 to get an approximation of the used amount of RAM in B.

std::cout << "Naive:" << std::endl;
for (int N = minN; N <= maxN; N+=2) {
    start = std::chrono::steady_clock::now();
    // do work...
    finish = std::chrono::steady_clock::now();
    std::ifstream statFile("/proc/self/stat");
    std::string statLine;
    std::getline(statFile, statLine);
    std::istringstream iss(statLine);
    std::string entry;
    long long memUsage;
    for (int i = 1; i <= 24; i++) {
        std::getline(iss, entry, ' ');
        if (i == 24) {
            memUsage = stoi(entry);
        }
    }

    long time_ns = std::chrono::duration_cast<std::chrono::nanoseconds>(finish - start).count();
    std::cout << "N = " << N << ", time = " << time_ns/1e9 << " s, ram = " << 4096*memUsage/1e9 << " GB" << std::endl;
}
  • 1
    It must be KB. Note that 65536 `double` elements alone takes `65536*8` bytes, and RSS counts all the rest of your program's code and data, including whatever comes from library functions. No C++ program on a present-day operating system could plausibly have an RSS of only 64K bytes. – Nate Eldredge Jun 30 '22 at 15:23
  • 3
    Being able to reset the maximum might be useful, but it would also invalidate the statistics that are reported back to the parent when your process exits. So I'm pretty sure that's not possible. You can get your *current* RSS by reading `/proc/self/stat` (though the `proc(5)` man page has some caveats about its accuracy); maybe it would suffice to sample that at some reasonable rate. – Nate Eldredge Jun 30 '22 at 15:30
  • 1
    One problems you are going to have is `free`d memory is not necessarily returned to to the operating system. Either part of the page is still in use or perhaps the `C` sub-allocator decides to keep freed memory on a free chain for later use and avoid too many trips into the kernel for new pages. – Richard Critten Jun 30 '22 at 15:43
  • Note that for large objects the runtime may be using `mmap()` to allocate storage and then free/delete/delete[] will call `munmap()` and used memory does go down. Small objects are often allocated with (s)brk and that can only free memory at the end. Any memory freed in the middle is kept internally for reuse. So the memory stats for the process are highly inaccurate. – Goswin von Brederlow Jun 30 '22 at 16:15
  • You can use the malloc hooks (where available), dynamically replace malloc or overload new and install your own memory tracking. Or pass an allocator to the STL containers you use. – Goswin von Brederlow Jun 30 '22 at 16:18
  • 1
    _Side note:_ My answer here [malloc is using 10x the amount of memory necessary](https://stackoverflow.com/a/39761864) has some info on RSS – Craig Estey Jun 30 '22 at 17:18
  • 1
    @GoswinvonBrederlow: Glibc malloc can report some usage info via `mallinfo2()` - https://www.gnu.org/software/libc/manual/html_node/Statistics-of-Malloc.html . BTW, I wouldn't say current RSS is *inaccurate* - it is the number of physical pages the OS is letting your process consume. Usually most of them are private dirty pages, so can only be evicted by swapping them out (to compressed storage or disk), not writing back to disk files or just dropping them (clean pages). – Peter Cordes Jul 01 '22 at 00:39
  • 1
    Your RSS for a given amount of actually allocated space depends on your malloc implementation; if the default glibc malloc doesn't perform well for your application, consider a different one that's better able to give back more memory to the OS after the pattern of allocations you make, if that's desirable. (Or tweak some of the tunable settings of glibc malloc.) – Peter Cordes Jul 01 '22 at 00:41
  • @NateEldredge makes sense, thanks. I'll look into checking `/proc/... ` – Max Maschke Jul 01 '22 at 04:03
  • @PeterCordes I looked at the manpages, am I correct in my understanding that I can only use `mallinfo2() ` or `malloc_info() ` if I actually use `malloc() `? I'm not used to C-style memory management – Max Maschke Jul 01 '22 at 04:12
  • 1
    @Konemu: libstdc++ uses the same allocator as malloc, so it might work. (On Linux with a normal gcc setup (i.e. using libstdc++), `new`/`delete` happen to be compatible with `malloc`/`free`, even though ISO C++ doesn't require that.) – Peter Cordes Jul 01 '22 at 04:14
  • @PeterCordes Alright, I'll give it a try later today. Thanks a lot! :) – Max Maschke Jul 01 '22 at 04:18
  • @PeterCordes Current RSS is what the process currently has in physical memory. So everything including read-only data, shared data, data and bss, stack, heap, all the freed memory that is kept for reuse and excluding anything not faulted in yet or swapped out. So inaccurate at best. Anyway, it's absolutely unsuitable to measure how much memory a function uses if you want to measure more than one function. You can't reset the value after each function. – Goswin von Brederlow Jul 01 '22 at 09:56
  • @GoswinvonBrederlow: Right, it's an accurate measure of a process's current usage of physical RAM, but not necessarily an accurate measure of its working set; memory pressure could lead to some of it being evicted or swapped out, and might never get paged back in, except maybe as it exits if you're unlucky and the authors make a big deal about freeing every allocation, touching cold memory again as it exits. But yeah, agreed it's not likely useful for incremental measurement of single functions, except in terms of actual in-situ measurement of a whole program including its malloc patterns. – Peter Cordes Jul 01 '22 at 10:02
  • The `time` utility can give you the number of major and minor pagefaults the program caused. Not sure what process variable it reads but you can access the same fields yourself. That at least would count mapping and unmapping of large blocks of memory each time. – Goswin von Brederlow Jul 01 '22 at 10:07
  • 1
    @GoswinvonBrederlow: AFAIK, everything reported by `time` is from the rusage data returned by `wait4`. So that's also in the same data that OP is retrieving with `getrusage`. – Nate Eldredge Jul 01 '22 at 14:51

0 Answers0