Achieve maximum bandwidth on arm64

Question

I am trying to achieve near maximum memory bandwidth on my system where theoretical maximum bandwidth is 25.5GB/s running with one DDR channel and 4 cores.

I tried running following strees-ng benchmark:

./stress-ng --taskset 0xf --memrate 1 --memrate-wr-mbs 50000 --memrate-rd-mbs 30000 -t 60

But I see maximum bandwidth is around 11000MB/s that is less than 50% of total maximum bandwidth.

Also, I see this blog about achieving maximum bandwidth:

https://codearcana.com/posts/2013/05/18/achieving-maximum-memory-bandwidth.html:

    void write_memory_rep_stosq(void* buffer, size_t size) {
       // size in bytes, assumed to be a multiple of 8
       asm("cld\n"          // usually unnecessary, compilers keep DF=0
       "rep stosq"
        : : "D" (buffer), "c" (size / 8), "a" (0) );
        // dangerously buggy: missing "memory" clobber
        // and telling the compiler RDI and RCX are pure inputs, not "+D" / "+c"
    }

And when I run, I get results that are really close to the peak bandwidth, thanks to modern x86 features like ERMSB handling this with optimized microcode.

          $ ./memory_profiler
          write_memory_rep_stosq: 20.60 GiB/s

But this is for x86_64, is there any such equivalent instruction for ARM64 ?

There's no single instruction. There are certainly efficient ways to copy memory. But it's almost certain that the smart people who wrote `memcpy` know about them. Have you tried simple `memcpy` and see how much bandwidth you get that yway? I'd be a bit surprised if handwritten assembly can do better. — Nate Eldredge, Mar 17 '22 at 16:02
@NateEldredge: In kernel code, SIMD registers aren't usable without big save/restore overhead. For aligned memset (e.g. clearing a page), `rep stos` does a good job, so it's a good thing that modern CPUs put some effort into making its microcode efficient! But if you're talking about ARM64, then yeah you'd need a loop with `stp`. — Peter Cordes, Mar 17 '22 at 16:11
Does `stress-ng` test with multiple threads? If not, one CPU core may not have enough memory-level parallelism to keep enough stores in flight to max out all the memory controllers. (Like [on a big Xeon](https://stackoverflow.com/questions/39260020/why-is-skylake-so-much-better-than-broadwell-e-for-single-threaded-memory-throug)) — Peter Cordes, Mar 17 '22 at 16:17
@PeterCordes, memrate specify the number of workers exercised memory read/writes. is it good idea to run multiple worker here ? Also its just one DRAM channel I have. — Milan, Mar 17 '22 at 16:21
Oh, I misread that when skimming, I was thinking 4 channels. IDK, 1 channel of fast DRAM might or might not be more than 1 ARM core could saturate. Depends a lot on the interconnect between cores, how fancy each core is, and whether there's any HW prefetching closer to DRAM (e.g. in an L2 cache). Definitely worth trying with more threads to see how it scales. — Peter Cordes, Mar 17 '22 at 17:10
@PeterCordes, did try the stress-ng with multiple workers but it didn't give me any better results, is there something else I can try (some specific program or something) ? — Milan, Mar 18 '22 at 08:10

Achieve maximum bandwidth on arm64

0 Answers0