Sadly, shrd
is horribly slow (3 clock latency on a range of devices).
Taking sn_a
as the shrd
version:
lea 0x1(%rdi),%rax # sn_a
imul %rdi
shrd $0x1,%rdx,%rax
# if you want %rdx:%rax need shr $1, $rdx here
retq
and sn_b
as my suggested alternative:
lea 0x1(%rdi),%rax # sn_b
or $0x1,%rdi
shr %rax
imul %rdi # %rdx:%rax is 128 bit result
retq
And the (largely) empty sn_e
:
mov %rdi,%rax # sn_e
retq
I got the following clock counts, per iteration of the timing loop (see below):
Ryzen 7 i7 (Coffee-Lake)
sn_a: 11.00 11.00
sn_b: 8.05 8.27 -- yay :-)
sn_e: 5.00 5.00
I believe that:
Ryzen 7 i7 Coffee-Lake
latency throughput latency throughput
shrd 3 1/3 3 1/3
imul 3 1/2 3 1/1 -- 128 bit result
imul 2 1/2 3 1/1 -- 64 bit result
where throughput is instructions/clocks. I believe the 128 bit imul delivers the ls 64 bits 1 clock earlier, or the ms 64 bits one clock later.
I think, what we see in the timings is -3 clocks by removing the shrd
, +1 clock for the shr $1
and or $1
(in parallel), -1 clock not using %rdx
.
Incidentally, both sn_a
and sn_b
return 0 for UINT64_MAX
! Mind you, the result overflows uint64_t
way earlier than that !
FWIW, my timing loop looks like this:
uint64_t n ;
uint64_t r ;
uint64_t m ;
m = zz ; // static volatile uint64_t zz = 0
r = 0 ;
n = 0 ;
qpmc_read_start(...) ; // magic to read rdpmc
do
{
n += 1 ;
r += sigma_n(n + (r & m)) ;
}
while (n < 1000000000) ;
qpmc_read_stop(....) ; // magic to read rdpmc
Where the + (r & m)
sets up a dependency so that the input to the function being timed depends on the result of the previous call. The r +=
collects a result which is later printed -- which helps persuade the compiler to not optimize away the loop.
The loop compiles to:
<sigma_timing_run+64>: // 64 byte aligned
mov %r12,%rdi
inc %rbx
and %r13,%rdi
add %rbx,%rdi
callq *%rbp
add %rax,%r12
cmp $0x3b9aca00,%rbx
jne <sigma_timing_run+64>
Replacing the + (r & m)
by + (n & m)
removes the dependency, but the loop is:
<sigma_timing_run+64>: // 64 byte aligned
inc %rbx
mov %r13,%rdi
and %rbx,%rdi
add %rbx,%rdi
callq *%rbp
add %rax,%r12
cmp $0x3b9aca00,%rbx
jne 0x481040 <sigma_timing_run+64>
which is the same as the loop with the dependency, but the timings are:
Ryzen 7 i7 (Coffee-Lake)
sn_a: 5.56 5.00
sn_b: 5.00 5.00
sn_e: 5.00 5.00
Are these devices wonderful, or what ?