Use int _rdrand64_step (unsigned __int64* val)
from immintrin.h
instead of writing inline asm. You don't need it, and there are many reasons (including this one) to avoid it: https://gcc.gnu.org/wiki/DontUseInlineAsm
In this case, the problem is that you're probably compiling 32-bit code, so of course 64-bit rdrand
is not encodeable. But the way you used inline-asm ended up giving you a 32-bit rdrand
, and storing garbage from another register for the high half.
gcc -Wall -O3 -m32 -march=ivybridge
(and similar for clang) produces (on Godbolt):
In function 'rdrand64_step':
7 : <source>:7:1: warning: unsupported size for integer register
rdrand64_step:
push ebx
rdrand ecx; setc al
mov edx, DWORD PTR [esp+8] # load the pointer arg
movzx eax, al
mov DWORD PTR [edx], ecx
mov DWORD PTR [edx+4], ebx # store garbage in the high half of *rand
pop ebx
ret
I guess you called this function with a caller that happened to have ebx=0
. Or else you used a different compiler that did something different. Maybe something else happens after inlining. If you looked at disassembly of what you actually compiled, you could explain exactly what's going on.
If you'd used the intrinsic, you would have gotten error: '_rdrand64_step' was not declared in this scope
, because immintrin.h
only declares it in 64-bit mode (and with a -march
setting that implies rdrand support. Or [-mrdrnd
]3. Best option: use -march=native
if you're building on the target machine).
You'd also get significantly more efficient code for a retry loop, at least with clang:
unsigned long long use_intrinsic(void) {
unsigned long long rand;
while(!_rdrand64_step(&rand)); // TODO: retry limit in case RNG is broken.
return rand;
}
use_intrinsic: # @use_intrinsic
.LBB2_1: # =>This Inner Loop Header: Depth=1
rdrand rax
jae .LBB2_1
ret
That avoids setcc and then testing that, which is of course redundant. gcc6 has syntax for returning flag results from inline asm. You can also use asm goto
and put a jcc inside the asm, jumping to a label: return 1;
target or falling through to a return 0
. (The inline-asm docs have an example of doing this. https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html. See also the inline-assembly tag wiki.)
Using your inline-asm, clang (in 64-bit mode) compiles it to:
use_asm:
.LBB1_1:
rdrand rax
setb byte ptr [rsp - 1]
cmp byte ptr [rsp - 1], 0
je .LBB1_1
ret
(clang makes bad decisions for constraints with multiple options that include memory.)
gcc7.2 and ICC17 actually end up with better code from the asm than from the intrinsic. They use cmovc
to get a 0 or 1 and then test
that. It's pretty dumb. But that's a gcc/ICC missed optimization that will hopefully be.