I am currently benchmarking and optimizing a program that makes heavy use of rdrand
instructions.
When looking for suspected performance penalties from misaligned loads/stores, I noticed an excessively high value of the ls_misal_loads.ma64
(64-byte misaligned loads) performance counter, which clearly wasn't caused by the program's memory accesses alone. In fact, the value seemed to directly depend on the number of rdrand
instructions executed.
Even further, perf
reported a very high number of data cache accesses, which are seemingly caused by rdrand
as well.
Take the following minimalistic example program (rd.asm):
bits 64
global main
main:
mov rdx, 1000000 ; counter
.loop:
rdrand rax
dec rdx
jne .loop
ret
Compile with
nasm -f elf64 rd.asm
gcc rd.o
Then
perf stat -e instructions,all_data_cache_accesses,ls_misal_loads.ma64 -- ./a.out
yields for counter = 1,000,000:
Performance counter stats for './a.out':
3,666,525 instructions
24,422,483 all_data_cache_accesses
3,022,185 ls_misal_loads.ma64
...and for counter = 2,000,000:
Performance counter stats for './a.out':
6,695,889 instructions
48,458,162 all_data_cache_accesses
6,016,069 ls_misal_loads.ma64
So, doubling the number of executed rdrand
instructions seems to double the number of data cache accesses and misaligned loads.
The measurements were done on an AMD EPYC 7763 CPU.
My questions:
- What is going on here? Why does
rdrand
(seem to) produce cache accesses, though it is supposed to be implemented solely on the CPU? - Can this high number for the given performance counter be dismissed as an artifact, or does it imply a further performance penalty besides the one caused by the latency of
rdrand
itself?