Optimizing sum of linear sequence = n * (n+1)/2 - anything faster than lea / imul / shrd?

Question

this is my assembly code used to calculate the sum rax = 1 + 2 + 3 + .... + rdi using the gauss method rax = (rdi + 1 ) * rdi / 2. Does any of you have an idea or know an intel instruction to further reduce the number of cycles needed ? remark: the nop is for the alignment

nop
lea rax, [rdi + 1] 
imul rdi
shrd rax, rdx, 1
ret

If you're not looping, aligning stuff with `nop` is not generally helpful; if the function entry point is 16-byte aligned, your `ret` shouldn't cause any uop-cache penalties even with the JCC erratum (if it hurts ret?). But yes, `shrd` looks like a reasonable way to avoid overflow compared to what `clang` does (https://godbolt.org/z/94Y8VJ). Are you aiming for front-end throughput, lower critical-path latency, or reducing contention for some execution unit? "cycles" is not a useful way to measure performance for this few instructions on modern OoO exec CPUs. What microarchitecture? Skylake? — Peter Cordes, May 03 '20 at 17:19
The thing to optimize here would be the call/ret overhead! Make this an assembler macro so it can inline, or write it in C with `__int128` to get the compiler do a widening imul and `shrd`. Yeah, GCC and clang both emit those instructions if you write it with `__int128` in C: https://godbolt.org/z/53BE1A — Peter Cordes, May 03 '20 at 17:24
Someone else asked a similar [question on reddit](https://www.reddit.com/r/asm/comments/ga8onx/efficient_multiplication/) a few days ago. Are you in the same class? — fuz, May 03 '20 at 17:33
If you are allowed to tweak the input and output format, using the LSb as the mantissa (i.e. value 1/2) would allow you to perform the division and the increment in parallel. Then the multiplication would give you a number with 2 bits of mantissa (1/2 and 1/4). This, of course, moves the cost to the caller — Margaret Bloom, May 03 '20 at 17:44
Do you care about the upper 64 bits, or are you only doing the 64x64->128-bit multiplication to ensure you can shift down the full result to obtain a correct 64-bit result? If you don't care about the upper 64 bits there is a wonderful trick using only fast bit-twiddling to ensure a 64x64->64 multiplication gives the same low 64 bit result without a (slow) 128-bit shift. — EOF, May 03 '20 at 21:09
@EOF: `shrd r64, r64, imm8` is only 1 uop, 3 cycle latency, on Sandybridge-family. It's unfortunately 6 uops on Zen / Zen2. https://www.uops.info/html-instr/SHRD_IMMB_R64_R64_I8.html. But full-mul is slightly slower, too, so it could be worth a couple instructions to right-shift the even number to set up for `imul r64, r64`. What did you have in mind? Note that `shrd` only writes its destination=first operand so this function is *not* doing an extended-precision right shift of the full 128-bit result; that would need `shr rdx, 1` after the `shrd`. — Peter Cordes, May 04 '20 at 02:23
@PeterCordes Seems the idea is already presented in the answer to this question. I'd have given `uint64_t gauss(uint64_t a) { return (a+1>>1)*(a|1); }` and written a bit about why it works, but I suppose it's obvious enough for an answer already. — EOF, May 04 '20 at 16:10
@EOF: Yup, thanks, I wasn't familiar with that trick, or at least didn't remember it, but on seeing the answer figured it must have been the trick you were talking about. Neat. Clang should be doing that! — Peter Cordes, May 04 '20 at 16:17
@PeterCordes BTW, `gcc` generates the 2-operand 64x64->64 -bit `imul` for my c-code. In addition to fewer uOps, I'd also expect the more flexible choice of output register to be advantageous once the function gets inlined. — EOF, May 04 '20 at 16:36

Chris Hall · Answer 1 · 2020-05-04T14:23:20.580

Sadly, shrd is horribly slow (3 clock latency on a range of devices).

Taking sn_a as the shrd version:

   lea    0x1(%rdi),%rax       # sn_a
   imul   %rdi
   shrd   $0x1,%rdx,%rax
# if you want %rdx:%rax need shr $1, $rdx here
   retq

and sn_b as my suggested alternative:

   lea    0x1(%rdi),%rax       # sn_b
   or     $0x1,%rdi
   shr    %rax
   imul   %rdi                 # %rdx:%rax is 128 bit result
   retq

And the (largely) empty sn_e:

   mov    %rdi,%rax            # sn_e
   retq

I got the following clock counts, per iteration of the timing loop (see below):

        Ryzen 7   i7 (Coffee-Lake)
  sn_a:  11.00     11.00
  sn_b:   8.05      8.27      -- yay :-)
  sn_e:   5.00      5.00

I believe that:

            Ryzen 7             i7 Coffee-Lake
        latency throughput   latency throughput  
  shrd     3       1/3          3       1/3
  imul     3       1/2          3       1/1  -- 128 bit result
  imul     2       1/2          3       1/1  --  64 bit result

where throughput is instructions/clocks. I believe the 128 bit imul delivers the ls 64 bits 1 clock earlier, or the ms 64 bits one clock later.

I think, what we see in the timings is -3 clocks by removing the shrd, +1 clock for the shr $1 and or $1 (in parallel), -1 clock not using %rdx.

Incidentally, both sn_a and sn_b return 0 for UINT64_MAX ! Mind you, the result overflows uint64_t way earlier than that !

FWIW, my timing loop looks like this:

  uint64_t  n ;
  uint64_t  r ;
  uint64_t  m ;

  m = zz ;                      // static volatile uint64_t zz = 0
  r = 0 ;
  n = 0 ;

  qpmc_read_start(...) ;        // magic to read rdpmc 
  do
    {
      n += 1 ;
      r += sigma_n(n + (r & m)) ;
    }
  while (n < 1000000000) ;

  qpmc_read_stop(....) ;        // magic to read rdpmc

Where the + (r & m) sets up a dependency so that the input to the function being timed depends on the result of the previous call. The r += collects a result which is later printed -- which helps persuade the compiler to not optimize away the loop.

The loop compiles to:

<sigma_timing_run+64>:          // 64 byte aligned
   mov    %r12,%rdi
   inc    %rbx
   and    %r13,%rdi
   add    %rbx,%rdi
   callq  *%rbp
   add    %rax,%r12
   cmp    $0x3b9aca00,%rbx
   jne    <sigma_timing_run+64>

Replacing the + (r & m) by + (n & m) removes the dependency, but the loop is:

<sigma_timing_run+64>:          // 64 byte aligned
   inc    %rbx
   mov    %r13,%rdi
   and    %rbx,%rdi
   add    %rbx,%rdi
   callq  *%rbp
   add    %rax,%r12
   cmp    $0x3b9aca00,%rbx
   jne    0x481040 <sigma_timing_run+64>

which is the same as the loop with the dependency, but the timings are:

        Ryzen 7   i7 (Coffee-Lake)
  sn_a:   5.56      5.00
  sn_b:   5.00      5.00
  sn_e:   5.00      5.00

Are these devices wonderful, or what ?

*for rdx:rax ... need `shr $1, $rdx` here -- for an extra 1 clok latency* - no, it's independent of `shrd`. It can start before `shrd` finishes. Potentially starting on the same clock cycle, on CPUs with shift ALUs on 2 different ports. It doesn't make the critical path latency worse, and really it's a separate output. — Peter Cordes, May 04 '20 at 14:17
And BTW, the real risk with SHRD is that it's 6 uops on Zen-family. Intel's single-uop version is mostly fine for throughput (although being a 3-cycle latency integer uop, it can only run on port 1 competing with `imul`, unlike other shifts). Your timing loop creates a fairly artificially long dependency chain that hides most throughput differences. Speaking of which, `imul %rdi, %rax` is single-uop because it only has to write one register. You should definitely use that unless you really want the full 128-bit result. On Zen1 that has twice the throughput of `imul r64`. — Peter Cordes, May 04 '20 at 14:35
https://www.uops.info/table.html?search=imul%20(r64&cb_lat=on&cb_tp=on&cb_uops=on&cb_ports=on&cb_CON=on&cb_NHM=on&cb_SNB=on&cb_HSW=on&cb_SKX=on&cb_ICL=on&cb_ZEN%2B=on&cb_ZEN2=on&cb_measurements=on&cb_base=on&cb_avx=on&cb_avx2=on&cb_sse=on — Peter Cordes, May 04 '20 at 14:36
@PeterCordes: I timed the 2 operand `imul` on my Ryzen 7... came out the same. Without the dependency on `r` in the loop, all versions give the same timing as the empty function. For timings like this I think one needs a "stand in" for the consumer of the result of the function. Feeding the result back into the loop seems to do that. Of course, the actual effect of different versions of the function is going to depend on the context(s) in which the function is used. So yes, the timing loop is artificial. If there is a better way I shall be happy to learn. — Chris Hall, May 04 '20 at 15:26
Yeah, that's expected. One-operand imul costs an extra uop to write the high half result, but the low-half result is still produced the same way. Since you're only testing latency, not front-end throughput, you can't detect the cost of that extra uop. call/ret overhead makes it pretty hard to tell anything. I'd normally unroll this a couple times inside an asm loop, and definitely not with call/ret. e.g. see my answer on [RDTSCP in NASM always returns the same value](https://stackoverflow.com/q/54621381) — Peter Cordes, May 04 '20 at 15:33

Optimizing sum of linear sequence = n * (n+1)/2 - anything faster than lea / imul / shrd?

1 Answers1