Fast memory allocation of big memory chunk

Question

I've been measuring the performance of different C/C++ allocation (and initialization) techniques for contiguous big chunks of memory. To do so, I tried to allocate (and write to) 100 randomly selected sizes, using uniform distribution and range of 20 to 4096 MB, and measured the time using std::chrono high_resolution_clock. Each measurement is done by a separate execution of a program, i.e. there should be no memory reuse (at least within the process).

madvise ON refers to calling madvise with MADV_HUGEPAGE flag, i.e. enabling transparent huge pages (2MB in case of my systems).

Using a single 16GB module of DDR4 with a clock speed of 2400 MT/s and a data width of 64 bits, I've got a theoretical maximal speed of 17.8 GB/s.

On Ubuntu 18.04.05 LTS (4.15.0-118-generic), memset of the already allocated memory block gets close to the theoretical limit, but the page_aligned allocation + memset is somewhat slower, as expected. new is very slow, probably due to its internal overhead (values in GB/s):

method              madvise     median  std
memset              madvise OFF 17.3    0.32
page_aligned+memset madvise ON  11.4    0.21
mmap+memset         madvise ON  11.3    0.23
new<double>[]()     madvise ON  3.2     0.06

Using two modules, I was expecting near to double performance (say 35 GB/s) due to dual-channel, at least for the write operation:

method              madvise     median  std
memset              madvise OFF 28.0    0.23
mmap+memset         madvise ON  14.5    0.18
page_aligned+memset madvise ON  14.4    0.17

As you can see, memset() reaches only 80% of the theoretical speed. Memory allocation + write speed increases only by 3 GB/s, reaching only 40% of the theoretical speed of the memory.

To make sure that I did not mess up something in the OS (I use it for a few years now), I installed fresh Ubuntu 20.04 (dual boot) and repeated the experiment. The fastest operations were these:

method              madvise     median  std
memset              madvise OFF 29.1    0.86
page_aligned+memset madvise ON  10.5    0.27
mmap+memset         madvise ON  10.5    0.31

As you can see, the results are reasonably similar for memset, but actually even worse for allocation + write operations.

Are you aware of a faster way of allocating (and initializing) big chunks of memory? For the record, I have tested combinations of malloc, new float/double arrays, _calloc, operator new, mmap and page_aligned for allocation, and memset and for loop for writing, together with the madvise flag.

The complete benchmark is located here: https://github.com/DStrelak/memory_allocation_bench . Below are the methods mentioned above.

memset:

void *p = malloc(bytes);
memset(p, std::numeric_limits<unsigned char>::max(), bytes); // write there, so we know it's allocated
reportTimeUs("memset", [&]{memset(p, 0, bytes);});

page_aligned+memset

reportTimeUs("page_aligned" + use_madvise_str + cSeparator + bytes_str, [&]{
    p = aligned_alloc(PAGE_SIZE, bytes);
    if (use_madvise) madvise(p, bytes, MADV_HUGEPAGE);
});

mmap+memset:

reportTimeUs("mmap+memset" + use_madvise_str + cSeparator + bytes_str, [&]{
    p = mmap(0, bytes, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
    if (use_madvise) madvise(p, bytes, MADV_HUGEPAGE);
    memset(p, std::numeric_limits<unsigned char>::max(), bytes);
});

new<double>[]()

reportTimeUs("new<double>[]()" + use_madvise_str + cSeparator + bytes_str, [&]{
    p = new double[count_double]();
    if (use_madvise) madvise(p, bytes, MADV_HUGEPAGE);
});

Dual channel memory mode doesn't grants you double the performance. (For example you've increased latency times and some overhead.) That's all. Read about the hardware specifications. PS this is a hardware related question and has nothing to do with programming. — paladin, Nov 26 '20 at 10:06
@paladin this question seems on topic to me. Memory allocation details, and even more in C/C++, sounds quite specific to programming. I definitely agree with the first part of the comment. — Pac0, Nov 26 '20 at 10:12
I don't fully understand your experiment: the "memset" experiment is measuring the performance of memsetting differently-sized blocks..? The "mmap+memset" is including an mmap for each memset in the measurement? — dyp, Nov 26 '20 at 10:21
PS some hardware architectures have some special abilities to flag entire memory areas as set to _zeroes_. While this is very uncommon for _8086_ compatible **C**PU, modern _AMD_ **G**PU have this function, called **HyperZ**. _nVidia_ has probably something similar. — paladin, Nov 26 '20 at 10:21
If this is about zeroing out memory, shouldn't `mmap` provide this guarantee by itself for anonymous mappings? — dyp, Nov 26 '20 at 10:23
How about _allocating_ huge pages instead of relying on transparent huge pages? — dyp, Nov 26 '20 at 10:26
Without the source code your measurements cannot be reproduced or verified. — Maxim Egorushkin, Nov 26 '20 at 16:59
Did you try `MAP_POPULATE` with `mmap()`? If you did it'll make the `mmap()` slower (due to allocating physical memory), and if you didn't it'll make the `memset()` slower (due to page faults used to allocate physical memory at first write for every page); where overall "with populate" should be faster than "without populate" (due to not having page faults). — Brendan, Nov 26 '20 at 23:46
@Brendan They could have tried anything, however, there are a few ways to get it right and a gazillion ways to get it wrong. Without the complete source code of the benchmark we don't know whether these numbers are meaningful at all. — Maxim Egorushkin, Nov 27 '20 at 03:04
I have added the relevant parts of the code, as well as link to the entire repo. — David Střelák, Nov 27 '20 at 14:46
@dyp I wanted to know what is the fastest way to allocate and wipe a big block of memory. My usecase is batch processing of images. Therefore I have tried several methods of setting memory to the requested value (to set or unset all bits, to be precise). Memset turned to be the fastest way to write to already allocated memory. Then I tried to allocate the memory and write to it at the same time (i.e. allocate memory via mmap and then memset it). To make sure that there are not different policies implemented by OS, I run the same experiment for different sizes. — David Střelák, Nov 27 '20 at 14:58
@paladin I agree that 'expecting near to double' sounds a bit optimistic, but I did not expect 20% penalization from the theoretical maxima, compared to only 3% for a single memory. — David Střelák, Nov 27 '20 at 15:26
@dyp To the best of my knowledge, OS does guarantee* that the memory you get does not contain somebody else's data, not that it will be zero. (*it might be zero, one, or whatever). However, I also wanted to know what is the fastest way to set/wipe memory on demand. From my understanding, THP is just a wrapper of the HP, so I hoped the difference will be negligible. — David Střelák, Nov 27 '20 at 15:51
Linux guarantees the contents to be zero: https://www.man7.org/linux/man-pages/man2/mmap.2.html (POSIX doesn't seem to provide `MAP_ANONYMOUS`, I'm confused about that..) — dyp, Nov 30 '20 at 09:51

Rachid K. · Answer 1 · 2020-11-26T20:45:19.253

1

When you "advise huge pages", this does not guarantee that you will get huge pages. This is a best effort from the kernel. Moreover, how are configured Transparent Huge Pages (THP): content of /sys/kernel/mm/transparent_hugepage/enabled ?

THP may introduce overhead as an underlying "garbage collector" kernel daemon named khugepaged is in charge of the coalescing of the physical pages to make huge pages. Some interesting papers exist on the performance evaluation/issues of the THP:

To make sure that all the measures are based on huge pages or not, it is preferable to disable THP and explicitly allocate Huge pages from the benchmark program as explained here for example.

edited Nov 26 '20 at 20:45

answered Nov 26 '20 at 14:40

Rachid K.

4,490
3
11
30

Thanks, I will read those properly over the weekend. Unfortunately, if I need to change the kernel to support HP (and turn off THP), then such a solution is not acceptable for me :(. – David Střelák Nov 27 '20 at 16:37
THP can be disabled before each test by writing "never" into "/sys/kernel/mm/transparent_hugepage/enabled". For Huge Pages from user space, check if the kernel is compiled with CONFIG_HUGETLBFS. If yes, then you are ready to use HP from your programs. – Rachid K. Nov 27 '20 at 19:14

Fast memory allocation of big memory chunk

1 Answers1