Why does RDRAND lead to data cache accesses and misaligned loads on Zen 3?

Question

I am currently benchmarking and optimizing a program that makes heavy use of rdrand instructions.

When looking for suspected performance penalties from misaligned loads/stores, I noticed an excessively high value of the ls_misal_loads.ma64 (64-byte misaligned loads) performance counter, which clearly wasn't caused by the program's memory accesses alone. In fact, the value seemed to directly depend on the number of rdrand instructions executed.

Even further, perf reported a very high number of data cache accesses, which are seemingly caused by rdrand as well.

Take the following minimalistic example program (rd.asm):

bits 64
global main
main:

  mov rdx, 1000000  ; counter

.loop:
  rdrand rax
  dec rdx
  jne .loop

  ret

Compile with

nasm -f elf64 rd.asm
gcc rd.o

Then

perf stat -e instructions,all_data_cache_accesses,ls_misal_loads.ma64 -- ./a.out

yields for counter = 1,000,000:

 Performance counter stats for './a.out':

         3,666,525      instructions
        24,422,483      all_data_cache_accesses
         3,022,185      ls_misal_loads.ma64

...and for counter = 2,000,000:

 Performance counter stats for './a.out':

         6,695,889      instructions
        48,458,162      all_data_cache_accesses
         6,016,069      ls_misal_loads.ma64

So, doubling the number of executed rdrand instructions seems to double the number of data cache accesses and misaligned loads.

The measurements were done on an AMD EPYC 7763 CPU.

My questions:

What is going on here? Why does rdrand (seem to) produce cache accesses, though it is supposed to be implemented solely on the CPU?
Can this high number for the given performance counter be dismissed as an artifact, or does it imply a further performance penalty besides the one caused by the latency of rdrand itself?

I'd guess it might just be an artifact of the internal uops of `rdrand` getting counted as loads. (Loading from what? IDK, maybe an internal MMIO device? But cacheable loads?) But it would be tricky to design an experiment to detect if `rdrand` disturbs / evicts data in the actual L1d cache. — Peter Cordes, Oct 18 '22 at 02:34
Not AMD, but `rdrand` on some Intel CPUs had an information leak bug, whose mitigation in microcode would cause the core to lock the entire L2 cache for the several thousand cycles that it took the instruction to complete. https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/special-register-buffer-data-sampling.html describes this as "an impact similar to legacy locked cache-line-split accesses." [...] — Nate Eldredge, Oct 18 '22 at 05:01
If your CPU was one of these, it might be a reasonable guess that it triggered the same mechanism as a misaligned locked access crossing a cache line, and perhaps holding that lock might have looked like a long string of cache accesses. But it's not your CPU, so maybe this is totally unrelated. — Nate Eldredge, Oct 18 '22 at 05:02
But anyhow, due to issues like that, making "heavy use" of `rdrand` might not be a great idea in general. [AMD had its share of bugs too](https://arstechnica.com/gadgets/2019/10/how-a-months-old-amd-microcode-bug-destroyed-my-weekend/), with one of their chips advertising `rdrand` in CPUID despite not actually having a hardware RNG, so that `rdrand` returned -1 every time. Fortunately that was severe enough that affected machines would often not boot at all, which is better than silently generating horribly insecure crypto keys. — Nate Eldredge, Oct 18 '22 at 05:06
@NateEldredge: I thought that AMD bug where it always produced `-1` was due to a microcode bug (not a lack of HWRNG hardware), and was fixed with a microcode update to make it work properly. Or did the fix actually disable the feature bit in CPUID so nothing would use it? — Peter Cordes, Oct 18 '22 at 10:25
Hmm, I guess I read that into the article which didn't actually say what the fix was. You may very well be right. I haven't been able to find the official erratum. — Nate Eldredge, Oct 18 '22 at 12:57
@PeterCordes To check whether `rdrand` causes any cache evictions, I have now conducted several "Prime+Probe"-style measurements in L1d, with a negative result. So, either it is some more bizarre effect, or it really is just an artifact caused by its microcode implementation. — janw, Nov 07 '22 at 09:39

Why does RDRAND lead to data cache accesses and misaligned loads on Zen 3?

0 Answers0