I've been measuring the performance of different C/C++ allocation (and initialization) techniques for contiguous big chunks of memory.
To do so, I tried to allocate (and write to) 100 randomly selected sizes, using uniform distribution and range of 20 to 4096 MB, and measured the time using std::chrono high_resolution_clock
.
Each measurement is done by a separate execution of a program, i.e. there should be no memory reuse (at least within the process).
madvise ON refers to calling madvise
with MADV_HUGEPAGE
flag, i.e. enabling transparent huge pages (2MB in case of my systems).
Using a single 16GB module of DDR4 with a clock speed of 2400 MT/s and a data width of 64 bits, I've got a theoretical maximal speed of 17.8 GB/s.
On Ubuntu 18.04.05 LTS (4.15.0-118-generic), memset
of the already allocated memory block gets close to the theoretical limit, but the page_aligned allocation + memset
is somewhat slower, as expected. new
is very slow, probably due to its internal overhead (values in GB/s):
method madvise median std
memset madvise OFF 17.3 0.32
page_aligned+memset madvise ON 11.4 0.21
mmap+memset madvise ON 11.3 0.23
new<double>[]() madvise ON 3.2 0.06
Using two modules, I was expecting near to double performance (say 35 GB/s) due to dual-channel, at least for the write operation:
method madvise median std
memset madvise OFF 28.0 0.23
mmap+memset madvise ON 14.5 0.18
page_aligned+memset madvise ON 14.4 0.17
As you can see, memset()
reaches only 80% of the theoretical speed. Memory allocation + write speed increases only by 3 GB/s, reaching only 40% of the theoretical speed of the memory.
To make sure that I did not mess up something in the OS (I use it for a few years now), I installed fresh Ubuntu 20.04 (dual boot) and repeated the experiment. The fastest operations were these:
method madvise median std
memset madvise OFF 29.1 0.86
page_aligned+memset madvise ON 10.5 0.27
mmap+memset madvise ON 10.5 0.31
As you can see, the results are reasonably similar for memset
, but actually even worse for allocation + write operations.
Are you aware of a faster way of allocating (and initializing) big chunks of memory? For the record, I have tested combinations of malloc
, new
float/double arrays, _calloc
, operator new
, mmap
and page_aligned for allocation, and memset
and for loop for writing, together with the madvise
flag.
The complete benchmark is located here: https://github.com/DStrelak/memory_allocation_bench . Below are the methods mentioned above.
memset:
void *p = malloc(bytes);
memset(p, std::numeric_limits<unsigned char>::max(), bytes); // write there, so we know it's allocated
reportTimeUs("memset", [&]{memset(p, 0, bytes);});
page_aligned+memset
reportTimeUs("page_aligned" + use_madvise_str + cSeparator + bytes_str, [&]{
p = aligned_alloc(PAGE_SIZE, bytes);
if (use_madvise) madvise(p, bytes, MADV_HUGEPAGE);
});
mmap+memset:
reportTimeUs("mmap+memset" + use_madvise_str + cSeparator + bytes_str, [&]{
p = mmap(0, bytes, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
if (use_madvise) madvise(p, bytes, MADV_HUGEPAGE);
memset(p, std::numeric_limits<unsigned char>::max(), bytes);
});
new<double>[]()
reportTimeUs("new<double>[]()" + use_madvise_str + cSeparator + bytes_str, [&]{
p = new double[count_double]();
if (use_madvise) madvise(p, bytes, MADV_HUGEPAGE);
});