Using intel inline assembler to code bigint add with carry

Question

I would like to do a fast code for adding 64 bit numbers in big ints:

uint64_t ans[n];
uint64_t a[n], b[n]; // assume initialized values....
for (int i = 0; i < n; i++)
  ans[i] = a[i] + b[i];

but the above does not work with carry.

I saw another question here that suggested using an if statement to check which is elegant:

ans[0] = a[0] + b[0];
int c = ans[0] < a[0];
for (int i = 0; i < n; i++) {
  ans[i] = a[i] + b[i] + c;
  c = ans[i] < a[i];
}

However, I would like to learn how to embed inline (intel) assembly and do it faster. I am sure there are 64 bit opcodes, the equivalent of:

add eax, ebx
adc ...

but I don't know how to pass parameters to the assembler from the rest of the c++ code.

Which compiler are you using? MSVC at least does not support 64-bit inline assembly — Govind Parmar, Dec 09 '16 at 00:31
That carry variant (if the add line is fixed to `ans[i] = a[i] + b[i] + c;`) is not correct: `a[] = {1, 0, 0}; b[] = { ~0, ~0, 0 };` => should result into `{ 0, 0, 1 }`, but it will end as `{ 0, 0, 0 }`. ... so.. maybe elegant, but not correct. (maybe you didn't copy it from that other question accurately?) — Ped7g, Dec 09 '16 at 01:10
Consider using [Intel Intrinsics](https://software.intel.com/sites/landingpage/IntrinsicsGuide/). — paddy, Dec 09 '16 at 01:20
Thinking about it, `c = ans[i] < (a[i] + c);` should fix that in similarly elegant way as it was originaly designed? — Ped7g, Dec 09 '16 at 01:28
Trying out some variantions of C++ on godbolt, but it looks like none of the compilers has the idea of add+adc(chain) idiom. Then again, it's unlikely to be needed often, so I can sort of understand that, but some of the code generated looks ridiculous... :D Actually `clang` unrolled the loop for two items, using `adc` on second go, then bringing carry as value for next iteration, so it's not like the `adc` is completely unheard of for clang, it just can't connect the dots completely. Or maybe there's some other way how to write it in C++ to give compiler better hint. — Ped7g, Dec 09 '16 at 01:36
Use [GMPlib](http://gmplib.org/). A bigint library is very difficult to design and implement. So don't reinvent it, use an existing one. — Basile Starynkevitch, Dec 09 '16 at 13:47

score 4 · Accepted Answer · edited May 23 '17 at 12:13

but the above does not work with carry.

If you mean that GCC does not generate code that uses the ADC instruction, that's because its optimizer has determined that there is a more optimal way to implement the addition.

Here is my test version of your code. I have extracted the arrays out as parameters passed to the function so that the code cannot be elided and we can limit our study to the relevant portions.

void Test(uint64_t* a, uint64_t* b, uint64_t* ans, int n)
{
    for (int i = 0; i < n; ++i)
    {
        ans[i] = a[i] + b[i];
    }
}

Now, indeed, when you compile this with a modern version of GCC and look at the disassembly, you'll see a bunch of crazy-looking code.

The Godbolt compiler explorer is helpful enough that it color-codes lines of C source and their corresponding assembly code (or at least, it does so to the best of its ability; this isn't perfect in optimized code, but it works well enough here). The purple code is what implements the 64-bit addition in the inner body of the loop. GCC is emitting SSE2 instructions to do the addition. Specifically, you can pick out MOVDQU (which does an unaligned move of a double quadword from memory into an XMM register), PADDQ (which does an addition on packed integer quadwords), and MOVQ (which moves a quadword from an XMM register into memory). Roughly speaking, for a non-assembly expert, MOVDQU is how it loads the 64-bit integer values, PADDQ does the addition, and then MOVQ stores the result.

Part of what makes this output especially noisy and confusing is that GCC is unrolling the for loop. If you disable loop unrolling (-fno-tree-vectorize), you get output that is easier to read, although it's still doing the same thing using the same instructions. (Well, mostly. Now it's using MOVQ everywhere, for both loads and stores, instead of loading with MOVDQU.)

On the other hand, if you specifically forbid the compiler from using SSE2 instructions (-mno-sse2), you see output that is significantly different. Now, because it can't use SSE2 instructions, it's emitting basic x86 instructions to do the 64-bit addition—and the only way to do it is ADD + ADC.

I suspect that this is the code you were expecting to see. Clearly, GCC believes that vectorizing the operation results in faster code, so this is what it does when you compile with the -O2 or -O3 flags. At -O1, it always uses ADD + ADC. This is one of those cases where fewer instructions does not imply faster code. (Or at least, GCC doesn't think so. Benchmarks on actual code might tell a different story. The overhead may be significant in certain contrived scenarios but irrelevant in the real world.)

For what it's worth, Clang behaves in very similar ways as GCC does here.

If you meant that this code doesn't carry the result of the previous addition over to the next addition, then you're right. The second snippet of code that you've shown implements that algorithm, and GCC does compile this using the ADC instruction.

At least, it does when targeting x86-32. When targeting x86-64, where you have native 64-bit integer registers available, no "carrying" is even necessary; simple ADD instructions are sufficient, requiring significantly less code. In fact, this is only "bigint" arithmetic on 32-bit architectures, which is why I have assumed x86-32 in all of the foregoing analysis and compiler output.

In a comment, Ped7g wonders why compilers don't seem to have the idea of the ADD+ADC chain idiom. I'm not entirely sure what he's referring to here, since he didn't share any examples of the input code that he tried, but as I've shown, compilers do use ADC instructions here. However, compilers don't chain carries across loop iterations. This is too difficult to implement in practice, because so many instructions clear the flags. Someone hand-writing the assembly code might be able to do it, but compilers won't.

(Note that c should probably be an unsigned integer to encourage certain optimizations. In this case, it just ensures that GCC uses an XOR instruction when preparing to do a 64-bit addition instead of a CDQ. Although slightly faster, not a huge improvement, but mileage may vary with real code.)

(Also, it's disappointing that GCC is unable to emit branchless code for setting c inside of the loop. With sufficiently random input values, branch prediction will fail, and you'll end up with relatively inefficient code. There are almost certainly ways of writing the C source to persuade GCC to emit branchless code, but that's an entirely different answer.)

I would like to learn how to embed inline (intel) assembly and do it faster.

Well, we've already seen that it might not necessarily be faster if you naïvely caused a bunch of ADC instructions to be emitted. Don't hand-optimize unless you are confident that your assumptions about performance are correct!

Also, not only is inline assembly difficult to write, debug, and maintain, but it may even make your code slower because it inhibits certain optimizations that could otherwise have been done by the compiler. You need to be able to prove that your hand-coded assembly is a significant enough performance win over what the compiler would generate that these considerations become less relevant. You also should confirm that there is no way to get the compiler to generate code that is close to your ideal output, either by altering flags or cleverly writing the C source.

But if you really wanted to, you could read any of a variety of online tutorials that teach you how to use GCC's inline assembler. This is a pretty good one; there are plenty of others. And of course, there is the manual. All will explain how "extended asm" allows you to specify input operands and output operands, which will answer your question of "how to pass parameters to the assembler from the rest of the c++ code".

As paddy and Christopher Oicles suggested, you should prefer intrinsics to inline assembly. Unfortunately, there are no intrinsics that cause ADC instructions to be emitted. Inline assembly is your only recourse there—that, or what I already suggested of writing the C source so that the compiler will do the Right Thing™ on its own.

There are _addcarry_u32 and _addcarry_u64 intrinsics, though. These cause ADCX or ADOX instructions to be emitted. These are "extended" versions of ADC that can produce more efficient code. They are part of the Intel ADX instruction set, introduced with the Broadwell microarchitecture. In my opinion, Broadwell does not have sufficiently high market penetration that you can simply emit ADCX or ADOX instructions and call it a day. Lots of users still have older machines, and it's in your interest to support them to the extent possible. They're great if you're preparing builds tuned for specific architectures, but I would not recommend it for general use.

I am sure there are 64 bit opcodes, the equivalent of: add+adc

There are 64-bit versions of ADD and ADC (and ADCX and ADOX) when you're targeting a 64-bit architecture. This would then allow you to implement 128-bit "bigint" arithmetic using the same pattern.

On x86-32, there are no 64-bit versions of these instructions in the base instruction set. You must turn to SSE2, like we saw GCC and Clang do.

Intel's intrinsics guide lists `_addcarry_u32` as the intrinsic for ADC, with a separate `_addcarryx_u32` for ADCX/ADOX. Last time I tried `_addcarry_u32`, though, gcc emitted pretty horrible code that kept getting the carry flag out of CF into an integer register with SETC, and then putting it back in with a compare or something (even with no loop involved). Also, gcc doesn't know about running two chains in parallel with ADCX/ADOX, and `_addcarryx_u32` is basically a synonym for the non-x version. (But presumably when it does learn, `-madx` will enable ADCX/ADOX for either intrinsic) — Peter Cordes, Dec 09 '16 at 17:40
Previous discussion of the challenges of ADC loops on different microarchitectures: http://stackoverflow.com/questions/32084204/problems-with-adc-sbb-and-inc-dec-in-tight-loops-on-some-cpus — Peter Cordes, Dec 09 '16 at 17:43

score 4 · Answer 2 · edited May 27 '19 at 19:19

I'm not entirely sure if this is what you were looking for, and my assembly skills are definitely not the best (lack of suffixes for example), but this uses ADC and should solve your problem.

Note the omission of the C++ for loop; we need to loop in asm because we need CF to survive across iterations. (GCC6 has flag output constraints, but not flag inputs; there's no way to ask the compiler to pass FLAGS from one asm statement to another, and gcc would probably do it inefficiently with setc/cmp even if there was syntax for it.)

#include <cstdint>
#include <iostream>

#define N 4

int main(int argc, char *argv[]) {

  uint64_t ans[N];
  const uint64_t a[N] = {UINT64_MAX, UINT64_MAX, 0, 0};
  const uint64_t b[N] = {2, 1, 3, 1};

  const uint64_t i = N;
  asm volatile (
      "xor %%eax, %%eax\n\t"      // i=0  and clear CF
      "mov %3, %%rdi\n\t"         // N

      ".L_loop:\n\t"

      "mov (%%rax,%1), %%rdx\n\t" // rdx = a[i]

      "adc (%%rax,%2), %%rdx\n\t" // rdx += b[i] + carry

      "mov %%rdx, (%%rax, %0)\n\t"// ans[i] = a[i] + b[i]

      "lea 8(%%rax), %%rax\n\t"   // i += 8 bytes

      "dec %%rdi\n\t"             // --i

      "jnz .L_loop\n\t"   // if (rdi == 0) goto .L_loop;
      : /* Outputs (none) */
      : /* Inputs */ "r" (ans), "r" (a), "r" (b), "r" (i)
      : /* Clobbered */ "%rax", "%rbx", "%rdx", "%rdi", "memory"
  );

  // SHOULD OUTPUT 1 1 4 1
  for (int i = 0; i < N; ++i)
    std::cout << ans[i] << std::endl;

  return 0;
}

In order to avoid setting the carry flag (CF), I needed to count down to 0 in order to avoid doing a CMP. DEC does not set the carry flag, so it may be the perfect contender for this application. ~~However, I don't know how to index from the beginning of arrays any faster using %rdi than the extra instruction and register needed for inc %rax.~~

The volatile and "memory" clobber are necessary because we only ask the compiler for pointer inputs, and don't tell it which memory we actually read and write.

On some older CPUs, notably Core2 / Nehalem, adc after inc will cause a partial-flag stall. See Problems with ADC/SBB and INC/DEC in tight loops on some CPUs. But on modern CPUs, this is efficient.

EDIT: As pointed out by @PeterCordes, my inc %rax and scaling by 8 with lea was horribly inefficient (and stupid now that I think about it). Now, it is simply lea 8(%rax), %rax.

Editor's note: we can save another instruction by using a negative index from the end of the array, counting up toward 0 with inc / jnz.

(This hard-codes the array size at 4. You could maybe make this more flexible by asking for the array length as an immediate constant, and -i as an input. Or asking for pointers to the end.)

// untested
  asm volatile (
      "mov   $-3, %[idx]\n\t"        // i=-3   (which we will scale by 8)

      "mov   (%[a]), %%rdx  \n\t"
      "add   (%[b]), %%rdx  \n\t"    // peel the first iteration so we don't have to zero CF first, and ADD is faster on some CPUs.
      "mov    %%rdx, (%0) \n\t"

      ".L_loop:\n\t"                        // do{
      "mov    8*4(%[a], %[idx], 8), %%rdx\n\t"   // rdx = a[i + len]
      "adc    8*4(%[b], %[idx], 8), %%rdx\n\t"   // rdx += b[i + len] + carry
      "mov    %%rdx,  8*4(%[ans], %[idx], 8)\n\t"  // ans[i] = rdx

      "inc    %[idx]\n\t"
      "jnz    .L_loop\n\t"                  // }while (++i);

      : /* Outputs, actually a read-write input */ [idx] "+&r" (i)
      : /* Inputs */ [ans] "r" (ans), [a] "r" (a), [b] "r" (b)
      : /* Clobbered */ "rdx", "memory"
  );

The loop label should probably use %%= in case GCC duplicates this code, or use a numbered local label like 1:

Using a scaled-index addressing mode is no more expensive than a regular indexed addressing mode (2 registers) like we were using before. Ideally we'd use a one-register addressing mode for either the adc or the store, maybe indexing the other two arrays relative to ans, by subtracting the pointers on input.

But then we'd need a separate LEA to increment by 8, because we still need to avoid destroying CF. Still, on Haswell and later, indexed stores can't use the AGU on port 7, and Sandybridge/Ivybridge they un-laminate to 2 uops. So for Intel SnB-family, avoiding an indexed store here would be good because we need 2x load + 1x store per iteration. See Micro fusion and addressing modes

Earlier Intel CPUs (Core2 / Nehalem) will have partial-flag stalls on the above loop, so the above issues are irrelevant for them.

AMD CPUs are probably fine with the above loop. Agner Fog's optimization and microarch guides don't mention any serious problems.

Unrolling a bit wouldn't hurt, though, for AMD or Intel.

You need a `"memory"` clobber or dummy `"m"` inputs and outputs; the compiler does *not* assume that the memory pointed-to by your register inputs is also an input or output, so can move reads or writes to those arrays across the asm statement. Also, `xor %eax,%eax` would zero EAX *and* set CF=0 (`clc`) more efficiently than the `mov $0,%eax` alone. — Peter Cordes, May 27 '19 at 17:15
The `inc` + `lea` to scale by 8 is inefficient; use `lea 8(%%rcx), %%ecx` to add 8 without affecting flags. — Peter Cordes, May 27 '19 at 17:17
@PeterCordes thank you for your insight; I will update my answer to reflect. — buhao, May 27 '19 at 18:24
Another trick to save loop overhead is indexing from the end of the array, with `mov $-3, %rax` before the loop. Inside the loop you have `inc %rax` / `jnz`, and you use *scaled* index addressing modes like `(%0, %rax, 8)`. You can bake in the "end of the array" thing too with a displacement, or ask the compiler for pointers to the ends of the arrays as inputs. I'll make an edit on your answer because that's easier than explaining it. — Peter Cordes, May 27 '19 at 18:36
I probably should have put my edit into a separate answer; there are no easy optimal solutions that are good on every microarchitecture so there's quite a lot to say. — Peter Cordes, May 27 '19 at 19:21

Using intel inline assembler to code bigint add with carry

2 Answers2

Linked

Related