How to leverage multichannel RAM?

Question

My objective

I want to write a RAM bandwidth benchmark

My solution

U64 benchRead(U64 &io_minLoopTicks) //io_minLoopTicks initial value is U64_MAX
{
    const U32 elementsCount = 100'000'000;
    U64 accumulator = 0;
    std::vector<U64> tab;
    tab.resize(elementsCount);  //800MB

    for(U8 i = 0; i < 10; ++i) // Loops to get a reliable value.
    {
        const U64 startTimestamp = profiler.start(); // profiler uses under the hood QueryPerformanceCounter

        for(const U64 j : tab)
            accumulator += j;  // Do something with the RAM to not be optimized away.

        const U64 loopTicks = profiler.end() - startTimestamp;

        if(loopTicks < io_minLoopTicks)
            io_minLoopTicks = loopTicks;
    }


    return accumulator;
}

My problem

The theory would be 57.6 GB/s (8 bytes (bus width) * 2 (channels) * 3.6 GT).
In practise I have 27.7 GB/s (9700K @4.8GHz, 4 single rank 3600MT DDR4 DIMMs).
So my number seems to reflect only one channel performance.

My guess is that the way I coded, the array is in an area of RAM that can only be accessed by one channel. I am not familiar at all with multi-channel so I don't really understand the limits of that technology.

For sure my RAM is dual channel (CPU-Z confirmed it)

Performance analysis

I checked the ouput of Clang14 and it uses SSE (O3 with no -march=native). I could be wrong (not an ASM guy) but it seems it unrolled the loop and does 128 bytes per iteration. A loop is 4.5 cycles on Coffee Lake according to uiCA leading to 139.38 GB/s, way above the RAM theoretical speed. So we should be RAM limited.

My questions

Is my array stored in one bank hence the fact it can only be accessed by one channel ?
How do we write code to leverage the benefits of multichannel RAM (here dual channel) ?

References

Godbolt

You should probably read the answer to [this question](https://stackoverflow.com/q/43343231/555045), that question does not directly have any relation to this one but parts of the answer can apply. Anyway one of the take-aways is that the full bandwidth may not be available to one core (although a core is not inherently limited to using only one channel and that seems like a coincidence here). — harold, Nov 30 '22 at 22:49
Is that really your benchmark loop? You allocate new memory and zero it, paying for page faults. And then you read it once. You're timing this whole function? See [Idiomatic way of performance evaluation?](https://stackoverflow.com/q/60291987) - put a repeat loop around the reading. (And maybe use `asm("" ::: "memory")` in the outer loop between repeats to stop the compiler from hoisting the sum out of the loop.) — Peter Cordes, Nov 30 '22 at 22:59
Yes, Coffee Lake (Skylake core) has enough load buffer entries (72) to keep all 12 LFBs occupied even with 4 vector loads per cache line. (https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(client)#Memory_subsystem) And the hardware prefetcher in L2 will run ahead of that, keeping the superqueue full (16 entries IIRC). In practice, SSE2 usually benchmarks pretty close to AVX2 for throughput, like 90% or more, for simple loops where data is coming all the way from DRAM, no cache hits. — Peter Cordes, Nov 30 '22 at 23:06
How are you measuring bandwidth? Inferring it from timing this function, or measuring it directly at the memory controllers, e.g. with performance events, or with `intel_gpu_top` (on Linux, unfortunately only works if you're using the iGPU, even though it does measure the integrated memory controllers which the CPU cores also use.) — Peter Cordes, Nov 30 '22 at 23:08
@PeterCordes No no, sorry I just kept the hot loop for the question but in the actual code I measure with QueryPerformanceCounter the innerloop and I do it 10 times in a bigger loop and keep the lowest value. And the array is allocated outside of the bigger loop so never measured (hopefully ?). I would not measure the function above entirely because of the memory allocation. And funny enough it doesn't zero it. I actually have trash values in it somehow. — Scr3amer, Dec 01 '22 at 05:13
@PeterCordes I added the profiling code 99% like the way I do it. The real one is this one but there is a lot of missing context for the objects, hopefully it is still readable: https://pastebin.com/JP2MBLb9 I am going to read your links and double check the ASM output of my actual code regarding the compiler barrier. — Scr3amer, Dec 01 '22 at 05:28
Is `U64` some custom struct or something? `std::vector.resize()` would zero it when it default-constructs the elements if it's `uint64_t` — Peter Cordes, Dec 01 '22 at 06:01
No, `using U64 = unsigned long long int;` I will double check maybe I missed something else (maybe a bug in my libc++). Is it a typo in your comment or CoffeLake has 12 LFBs compared to 10 on Skylake ? (I read 10 on the wikichip diagram) — Scr3amer, Dec 01 '22 at 06:05
@Scr3amer: Wikichip missed that HSW to SKL change. Skylake has 12 LFBs. Looking for a source for that info; might have been experimentally determined by Travis Downs (@ Beeonrope). Found it: https://github.com/Kobzol/hardware-effects/issues/1 — Peter Cordes, Dec 01 '22 at 06:17
Understood, thanks for the LFB link. Super interesting. Also what does hoisting mean ? I checked the translation (I'm French), I thought I understood but after reading the SO thread about how to optimize you linked I now have doubts. It is taking the value out of the loop and .... ? I am missing something on what it means and its impact. — Scr3amer, Dec 01 '22 at 08:30