Erroneous single thread memory bandwidth benchmark

Question

In an attempt to measure the bandwidth of the main memory, I have come up with the following approach.

Code (for the Intel compiler)

#include <omp.h>

#include <iostream> // std::cout
#include <limits> // std::numeric_limits
#include <cstdlib> // std::free
#include <unistd.h> // sysconf
#include <stdlib.h> // posix_memalign
#include <random> // std::mt19937


int main()
{
    // test-parameters
    const auto size = std::size_t{150 * 1024 * 1024} / sizeof(double);
    const auto experiment_count = std::size_t{500};
    
    //+/////////////////
    // access a data-point 'on a whim'
    //+/////////////////
    
    // warm-up
    for (auto counter = std::size_t{}; counter < experiment_count / 2; ++counter)
    {
        // garbage data allocation and memory page loading
        double* data = nullptr;
        posix_memalign(reinterpret_cast<void**>(&data), sysconf(_SC_PAGESIZE), size * sizeof(double));
        if (data == nullptr)
        {
            std::cerr << "Fatal error! Unable to allocate memory." << std::endl;
            std::abort();
        }
        //#pragma omp parallel for simd safelen(8) schedule(static)
        for (auto index = std::size_t{}; index < size; ++index)
        {
            data[index] = -1.0;
        }
        
        //#pragma omp parallel for simd safelen(8) schedule(static)
        #pragma omp simd safelen(8)
        for (auto index = std::size_t{}; index < size; ++index)
        {
            data[index] = 10.0;
        }
        
        // deallocate resources
        free(data);
    }
    
    // timed run
    auto min_duration = std::numeric_limits<double>::max();
    for (auto counter = std::size_t{}; counter < experiment_count; ++counter)
    {
        // garbage data allocation and memory page loading
        double* data = nullptr;
        posix_memalign(reinterpret_cast<void**>(&data), sysconf(_SC_PAGESIZE), size * sizeof(double));
        if (data == nullptr)
        {
            std::cerr << "Fatal error! Unable to allocate memory." << std::endl;
            std::abort();
        }
        //#pragma omp parallel for simd safelen(8) schedule(static)
        for (auto index = std::size_t{}; index < size; ++index)
        {
            data[index] = -1.0;
        }
        
        const auto dur1 = omp_get_wtime() * 1E+6;
        //#pragma omp parallel for simd safelen(8) schedule(static)
        #pragma omp simd safelen(8)
        for (auto index = std::size_t{}; index < size; ++index)
        {
            data[index] = 10.0;
        }
        const auto dur2 = omp_get_wtime() * 1E+6;
        const auto run_duration = dur2 - dur1;
        if (run_duration < min_duration)
        {
            min_duration = run_duration;
        }
        
        // deallocate resources
        free(data);
    }
    
    // REPORT
    const auto traffic = size * sizeof(double) * 2; // 1x load, 1x write
    std::cout << "Using " << omp_get_max_threads() << " threads. Minimum duration: " << min_duration << " us;\n"
        << "Maximum bandwidth: " << traffic / min_duration * 1E-3 << " GB/s;" << std::endl;
    
    return 0;
}

Notes on code

Assumed to be a 'naive' approach, also linux-only. Should still serve as a rough indicator of model performance
using ICC with compiler flags -O3 -ffast-math -march=coffeelake
size (150 MiB) is much bigger than lowest level cache of system (9 MiB on i5-8400 Coffee Lake), with 2x 16 GiB DIMM DDR4 3200 MT/s
new allocations on each iteration should invalidate all cache-lines from the previous one (to eliminate cache hits)
minimum latency is recorded to counter-act the effects of interrupts and OS-scheduling: threads being taken off cores for a short while etc.
a warm-up run is done to counter-act the effects of dynamic frequency scaling (kernel feature, can alternatively be turned off by using the userspace governor).

Results of code

On my machine, I am getting 90 GB/s. Intel Advisor, which runs its own benchmarks, has calculated or measured this bandwidth to actually be 25 GB/s. (See my previous question: Intel Advisor's bandwidth information where a previous version of this code was getting page-faults inside the timed region.)

Assembly: here's a link to the assembly generated for the above code: https://godbolt.org/z/Ma7PY49bE

I am not able to understand how I'm getting such an unreasonably high result with my bandwidth. Any tips to help facilitate my understanding would be greatly appreciated.

You are only writing (not 1 write + 1 read). After the second loop, the write-back cache probably is still full. How do the numbers change, when increasing size? Perhaps you can try `wbinvd` before stopping and starting the timer https://www.felixcloutier.com/x86/wbinvd (but it may continue before finishing emptying the caches) wbinvd may not work, if your program runs in user mode as it is a privileged operation. You could also write to write-combined memory — Sebastian, Mar 10 '22 at 20:49
@Sebastian: The buffer size (150MiB) is well above the 9MiB total L3 cache size. Using NT stores is something you could indeed do, but for sizes much greater than L3 cache, you'd expect NT stores to be faster since you pay only for actual writes, not RFOs. ([Enhanced REP MOVSB for memcpy](https://stackoverflow.com/q/43343231)). Still, good point that it's worth comparing with that. But I don't recommend `wbinvd`! Very hard to use. Loop over the buffer again with `clflushopt` between timed runs, or just make it even bigger (like 1GiB) to make L3 hits even rarer. — Peter Cordes, Mar 10 '22 at 21:01
A [previous question](https://stackoverflow.com/questions/71421552/intel-advisors-bandwidth-information) from the same user mentioned in an edit getting the exact same 90GB/s with a smaller 50MiB buffer, so this might not just be a timing artifact. — Peter Cordes, Mar 10 '22 at 21:06
@Sebastian: Oh, this is compiled with ICC according to the Godbolt link, and already is using `vmovntpd ymm` NT stores! (Which behaves the same as storing to uncacheable write-combining memory.) — Peter Cordes, Mar 10 '22 at 21:09
Is the 25 GB/s a CPU or a RAM limitation? What is your hardware configuration? CPU/CPU Clock/Mainboard/Memory/Banks/Memory Frequency/Waitstates? Where is the reference for 25 GB/s? — Sebastian, Mar 10 '22 at 21:36
@Sebastian: The details about 25 GB/s are in their previous question, which they forgot to link: [Intel Advisor's bandwidth information](https://stackoverflow.com/q/71421552). 25 GB/s is supposedly a run-time measurement (and sounds sensible to me for a desktop coffee lake with unspecified RAM) — Peter Cordes, Mar 10 '22 at 21:44
*new allocations on each iteration should invalidate all cache-lines from the previous one (to eliminate cache hits),* - you are 100% defeating that goal by storing `data[index] = -1.0;` to the same buffer in a loop right before the timed region. You don't even stride through pages in that loop to just get page faults out of the way, you're touching every byte. And since you're already taking the minimum out of 500 runs, there's no need for a separate warm-up. (Where you loop twice over the same buffer.) — Peter Cordes, Mar 10 '22 at 21:48
OTOH, freeing and allocating a new buffer could lead to differences in transparent hugepage kernel behaviour, maybe sometimes getting more lucky with the allocation. Still, looping only twice over the same buffer makes it less likely to see a benefit. — Peter Cordes, Mar 10 '22 at 21:49
Compiling with GCC11.1 `g++ -O3 -ffast-math -march=skylake -fopenmp -std=gnu++17 membench.cpp` (thus not using NT stores) and running on my i7-6700k (DDR4-2666), I get "Minimum duration: 10618.2 us; Maximum bandwidth: 29.6259 GB/s;". (The "Using 8 threads" notice it prints out is obviously incorrect; system load during the run showed only 1 core busy, nearly evenly split between user and kernel, with `perf record` / `perf report` showing 55% of time spent in main, 40% spent in the kernel in `clear_page_erms`.) I don't have ICC set up locally. — Peter Cordes, Mar 10 '22 at 22:02
@Sebastian (first comment) I assumed the write-back lines in the cache exist in the _same_ cache(s) and will therefore be evicted in order to accommodate incoming read lines. — Nitin Malapally, Mar 11 '22 at 08:16
@Peter The concept of NT stores is new to me. We could still consider that as part of the DRAM bandwidth because this data is still passed through the same channels to memory, isn't it? Ignoring the fact that it doesn't trickle down the cache hierarchy, of course. — Nitin Malapally, Mar 11 '22 at 08:23
@PeterCordes (... _you are 100% defeating that goal by storing_ ...) since the data is large, I assumed that the 'tail' of it forces the eviction of the 'waist' of it. Therefore, in its own way, the first touch run clears the caches for the timed run. — Nitin Malapally, Mar 11 '22 at 08:26
@Sebastian 2x 16 GiB DIMM DDR4 3200 MT/s with a width of 64 bits. Hope that's enough information — Nitin Malapally, Mar 11 '22 at 08:47
@MotiveHunter: Yes, exactly, the tail of one run evicts the head for the next run, and so on. That's why you don't need to free/realloc/warmup at all, spending large amount of time redoing page-fault costs between actual timed runs. The timed region runs *right* after you just looped over it, so either that's fine or you're screwed. With a 150MiB buffer on 9MiB of L3 cache, it's fine. (It's common to plot observed bandwidth vs. buffer size and see a stair-step effect near cache-size boundaries, with some being less sharp than others.) — Peter Cordes, Mar 11 '22 at 08:50
@MotiveHunter: Re: NT stores: yes, in fact that's probably the best way to measure actual DRAM write bandwidth, without having it compete against reads due to RFOs. glibc memset will use NT stores IIRC (with an unrolled AVX loop) for large sizes; this manual looping is pretty over-complicated and more a matter of how different compilers vectorize with NT stores or not. — Peter Cordes, Mar 11 '22 at 08:53
@PeterCordes So why can't I consider it to be 1x read and 1x write? The data is being fetched to read and it is being flushed back to memory through the same pathways. — Nitin Malapally, Mar 11 '22 at 08:56
You're not reading any data in the source, you're just storing a constant. The only significant reads would be RFOs, an artifact of the cache-coherence mechanism. Contiguous NT stores avoid RFOs. Did you read [Enhanced REP MOVSB for memcpy](https://stackoverflow.com/q/43343231) ? If not, go do that now. — Peter Cordes, Mar 11 '22 at 08:58
Your CPU (mentioned in the other question: Intel(R) Core(TM) i5-8400 CPU @ 2.80GHz (Turbo 4.0 GHz) [Coffee Lake]) can handle two memory channels. The theoretical maximum rate of the memory would be (if both channels can be used depending on your mainboard and the slots the memory is installed in) is 3200 MT/s * 8 B/T * 2 = 51.200 MB/s. Intel specifies the maximum as 41.6 GB/s, probably with a slower memory speed. — Sebastian, Mar 11 '22 at 09:11
@Sebastian Yes, that should be the case when I use all 6 cores. In this benchmark, as you can see in the code, the OpenMP directives have been commented out. So we're talking about single thread BW. — Nitin Malapally, Mar 11 '22 at 09:18
Yes, but independent of the instructions and structure of the CPU and caches, this should be the overall maximum of actually executed memory writes. — Sebastian, Mar 11 '22 at 09:27

Nitin Malapally · Answer 1 · 2022-03-11T11:08:50.237

Actually, the question seems to be, "why is the obtained bandwidth so high?", to which I have gotten quite a lot of input from @PeterCordes and @Sebastian. This information needs to be digested in its own time.

I can still offer an auxiliary 'answer' to the topic of interest. By substituting the write operation (which, as I now understand, cannot be properly modeled in a benchmark without delving into the assembly) by a cheap e.g. a bitwise operation, we can prevent the compiler from doing its job a little too well.

Updated code

#include <omp.h>

#include <iostream> // std::cout
#include <limits> // std::numeric_limits
#include <cstdlib> // std::free
#include <unistd.h> // sysconf
#include <stdlib.h> // posix_memalign


int main()
{
    // test-parameters
    const auto size = std::size_t{100 * 1024 * 1024};
    const auto experiment_count = std::size_t{250};
    
    //+/////////////////
    // access a data-point 'on a whim'
    //+/////////////////
    
    // allocate for exp. data and load the memory pages
    char* data = nullptr;
    posix_memalign(reinterpret_cast<void**>(&data), sysconf(_SC_PAGESIZE), size);
    if (data == nullptr)
    {
        std::cerr << "Fatal error! Unable to allocate memory." << std::endl;
        std::abort();
    }
    for (auto index = std::size_t{}; index < size; ++index)
    {
        data[index] = 0;
    }
    
    // timed run
    auto min_duration = std::numeric_limits<double>::max();
    for (auto counter = std::size_t{}; counter < experiment_count; ++counter)
    {
        // run
        const auto dur1 = omp_get_wtime() * 1E+6;
        #pragma omp parallel for simd safelen(8) schedule(static)
        for (auto index = std::size_t{}; index < size; ++index)
        {
            data[index] ^= 1;
        }
        const auto dur2 = omp_get_wtime() * 1E+6;
        const auto run_duration = dur2 - dur1;
        if (run_duration < min_duration)
        {
            min_duration = run_duration;
        }
    }
    
    // deallocate resources
    free(data);
        
    // REPORT
    const auto traffic = size * 2; // 1x load, 1x write
    std::cout << "Using " << omp_get_max_threads() << " threads. Minimum duration: " << min_duration << " us;\n"
        << "Maximum bandwidth: " << traffic / min_duration * 1E-3 << " GB/s;" << std::endl;
    
    return 0;
}

The benchmark remains a 'naive' one and shall only serve as an indicator of the model's performance (as opposed to a program which can exactly calculate the memory bandwidth).

With the updated code, I get 24 GiB/s for single thread and 37 GiB/s when all 6 cores get involved. When compared to Intel Advisor's measured values of 25.5 GiB/s and 37.5 GiB/s, I think this is acceptable.

@PeterCordes I have retained the warm-up loop to once do an exactly identical run of the whole procedure so as to counter-act against effects unknown (healthy programmer's paranoia).

Edit In this case, the warm-up loop is indeed redundant because the minimum duration is being clocked.

You never had a "copy" operation, you had a "write" operation like `memset` (one constant stored to every element). A "copy" would involve two arrays, like `memcpy`. But yes, an RMW operation should defeat NT stores and let you measure total read+write bandwidth for that mix. — Peter Cordes, Mar 11 '22 at 11:01
The warm-up is pointless when you're taking the min duration across all iterations. Earlier iterations can serve *as* the warm-up. Especially when you already loop over the data once, storing zeros. You'd only need a warm-up if you wanted to time smaller buffers, where you'd want to repeat *inside* the timed region (to make it long enough for good precision) so you need to be warmed up when timing starts. — Peter Cordes, Mar 11 '22 at 11:04

Erroneous single thread memory bandwidth benchmark

1 Answers1