Fastest way to set a Carry Flag

Question

I'm doing a cycle to sum two arrays. My objective is do it by avoiding carry checks c = a + b; carry = (c<a). I lost the CF when I do the loop test, with the cmp instruction.

Currently, i am using and the JEand STC to test and set the previously saved state of CF. But the jump takes more less 7 cycles, what it is a lot for what I want.

   //This one is working
   asm(
        "cmp $0,%0;"
        "je 0f;"
        "stc;"
    "0:"   
        "adcq %2, %1;"
        "setc %0"

    : "+r" (carry), "+r" (anum)
    : "r" (bnum)
   );

I already tried use the SAHF (2 + 2(mov) cycles), but that do not worked.

   //Do not works
   asm(
        "mov %0, %%ah;"
        "sahf;"
        "adcq %2, %1;"
        "setc %0"

        : "+r" (carry), "+r" (anum)
        : "r" (bnum)
   );

Anyone knows a way to set the CF more quickly? Like a direct move or something similar..

Are you doing 32 bits on a 64 bit machine? or 64 bit adds on a 64 bit machine? — Ira Baxter, Feb 09 '16 at 18:58
I am using a 64bit machine. For record, the `anum` and `bnum` are 8 bytes long. — Hélder Gonçalves, Feb 09 '16 at 20:12

score 3 · Accepted Answer · edited May 23 '17 at 12:23

Looping without clobbering CF will be faster. See that link for some better asm loops.

Don't try to write just the adc with inline asm inside a C loop. It's impossible for that to be optimal, because you can't ask gcc not to clobber flags. Trying to learn asm with GNU C inline asm is harder than writing a stand-alone function, esp. in this case where you are trying to preserve the carry flag.

You could use setnc %[carry] to save and subb $1, %[carry] to restore. (Or cmpb $1, %[carry] I guess.) Or as Stephen points out, negb %[carry].

0 - 1 produces a carry, but 1 - 1 doesn't.

Use a uint8_t to variable to hold the carry, since you will never add it directly to %[anum]. This avoids any chance of partial-register slowdowns. e.g.

uint8_t carry = 0;
int64_t numa, numb;

for (...) {
    asm ( "negb   %[carry]\n\t"
          "adc    %[bnum], %[anum]\n\t"
          "setc   %[carry]\n\t"
          : [carry] "+&r" (carry), [anum] "+r" (anum)
          : [bnum] "rme" (bnum)
          : // no clobbers
        );
}

You could also provide an alternate constraint pattern for register source, reg/mem dest. I used an x86 "e" constraint instead of "i", because 64bit mode still only allows 32bit sign-extended immediates. gcc will have to get larger compile-time constants into a register on its own. Carry is early-clobbered, so even if it and bnum were both 1 to start with, gcc couldn't use the same register for both inputs.

This is still terrible, and increases the length of the loop-carried dependency chain from 2c to 4c (Intel pre-Broadwell), or from 1c to 3c (Intel BDW/Skylake, and AMD).

So your loop runs at 1/3rd speed because you're using a kludge instead of writing the whole loop in asm.

A previous version of this answer suggested adding the carry directly, instead of restoring it into CF. This approach has a fatal flaw: it mixed up the incoming carry into this iteration with the outgoing carry going to the next iteration.

Also, sahf is Set AH from Flags. lahf is Load AH into Flags (and it operates on the whole low 8 bits of flags. Pair those instructions; don't use lahf on a 0 or 1 that you got from setc.

Read the insn set reference manual for any insns that don't seem to be doing what you expect. See https://stackoverflow.com/tags/x86/info

Unfortunately `add carry, a; add b, a` does not set the carry flag correctly in the case where there is carry-in and `a` is `0xffff....ffff`. — Stephen Canon, Feb 09 '16 at 18:30
I already have done an implementation of the cycle without clobbering, and just using registers, but it just worked on my computer. When I put it on other one the result was segmentation fault, because the used registers were different. Note: The compilers were also different (gcc & icc). Do you know if this always happens by switching the microarchitecture (ex: Sandy Bridge & Haswell), or it is also because the compiler? — Hélder Gonçalves, Feb 09 '16 at 19:07
To the second example this works: `asm("mov %0, %%ah; "sahf; :"+r" (aux)); ` `asm("adcq %2, %1;" "setc %0" : "+r" (aux), "+r" (anum) : "r" (bnum)); ` But unfortunately, these things happen: `movb %dl, -0x3d(%rbp); movb -0x3d(%rbp), %dl` — Hélder Gonçalves, Feb 09 '16 at 19:08
@HélderGonçalves: If you write your constraints wrong, you will have problems. This is why inline asm is the hardest way to learn asm. Also, if you're going to use `sahf`/`lahf`, you should use an `int` with an `"a"` constraint to force it to be in `eax`. Always avoid writing `mov` instructions when you can get gcc to put stuff where you need it. Often it can do so without extra instructions. — Peter Cordes, Feb 09 '16 at 19:46
@PeterCordes: That doesn't fix the problem; now you end up off by one, and still don't set carry correctly. There's a few ways to fix this, one simple one is to add `-1` to carry, then do `adc b, a`. — Stephen Canon, Feb 09 '16 at 21:00
You can also restore `carry` to `CF` by negating it, subject to the partial flags update hazard on some µarches. — Stephen Canon, Feb 09 '16 at 21:08
@StephenCanon: I see now, I was turning an outgoing carry from `a+carry` into an incoming carry. Thanks again. — Peter Cordes, Feb 09 '16 at 21:41

score 0 · Answer 2 · answered Feb 10 '16 at 03:35

If the array size is known at compile time, you could do something like this:

#include <inttypes.h>
#include <malloc.h>
#include <stdio.h>
#include <memory.h>

#define str(s) #s
#define xstr(s) str(s)

#define ARRAYSIZE 4

asm(".macro AddArray2 p1, p2, from, to\n\t"
    "movq (\\from*8)(\\p2), %rax\n\t"
    "adcq %rax, (\\from*8)(\\p1)\n\t"
    ".if \\to-\\from\n\t"
    "   AddArray2 \\p1, \\p2, \"(\\from+1)\", \\to\n\t"
    ".endif\n\t"
    ".endm\n");

asm(".macro AddArray p1, p2, p3\n\t"
    "movq (\\p2), %rax\n\t"
    "addq %rax, (\\p1)\n\t"
    ".if \\p3-1\n\t"
    "   AddArray2 \\p1, \\p2, 1, (\\p3-1)\n\t"
    ".endif\n\t"
    ".endm");

int main()
{
   unsigned char carry;

   // assert(ARRAYSIZE > 0);

   // Create the arrays
   uint64_t *anum = (uint64_t *)malloc(ARRAYSIZE * sizeof(uint64_t));
   uint64_t *bnum = (uint64_t *)malloc(ARRAYSIZE * sizeof(uint64_t));

   // Put some data in
   memset(anum, 0xff, ARRAYSIZE * sizeof(uint64_t));
   memset(bnum, 0, ARRAYSIZE * sizeof(uint64_t));
   bnum[0] = 1;

   // Print the arrays before the add
   printf("anum: ");
   for (int x=0; x < ARRAYSIZE; x++)
   {
      printf("%I64x ", anum[x]);
   }
   printf("\nbnum: ");
   for (int x=0; x < ARRAYSIZE; x++)
   {
      printf("%I64x ", bnum[x]);
   }
   printf("\n");

   // Add the arrays
   asm ("AddArray %[anum], %[bnum], " xstr(ARRAYSIZE) "\n\t"
        "setc %[carry]" // Get the flags from the final add

       : [carry] "=q"(carry)
       : [anum] "r" (anum), [bnum] "r" (bnum)
       : "rax", "cc", "memory"
   );

   // Print the result
   printf("Result: ");
   for (int x=0; x < ARRAYSIZE; x++)
   {
      printf("%I64x ", anum[x]);
   }
   printf(": %d\n", carry);
}

This gives code like this:

mov    (%rsi),%rax
add    %rax,(%rbx)
mov    0x8(%rsi),%rax
adc    %rax,0x8(%rbx)
mov    0x10(%rsi),%rax
adc    %rax,0x10(%rbx)
mov    0x18(%rsi),%rax
adc    %rax,0x18(%rbx)
setb   %bpl

Since adding 1 to all f's will completely overflow everything, the output from the code above is:

anum: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff
bnum: 1 0 0 0
Result: 0 0 0 0 : 1

As written, ARRAYSIZE can be up to about 100 elements (due to gnu's macro depth nesting limits). Seems like it should be enough...

Fastest way to set a Carry Flag

2 Answers2

Linked