2

Yeah thanks that works. @PeterCordes. Also __int128 works. But one more thing as you said using the intrinsics of multiprecision arithmetic that is _addcarry_u64 in C, using the header file immintrin.h I have the following code

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <immintrin.h>

unsigned char _addcarry_u64(unsigned char c_in, uint64_t src1, uint64_t src2,uint64_t *sum);

int main()
{
    unsigned char carry;
    uint64_t sum;
    long long int c1=0,c2=0;
    uint64_t a=0x0234BDFA12CD4379,b=0xA8DB4567ACE92B38;
    carry = _addcarry_u64(0,a,b,&sum);
    printf("sum is %lx and carry value is %u n",sum,carry);
    return 0;
}

Can you please point me out the error? I'm getting undefined reference to _addcarry_u64. Some quick google doesn't answer the problem , if any other header file to be used or it is not compatible with gcc and why so

Initially I had this code for adding two 64 bit numbers:

static __inline int is_digit_lessthan_ct(digit_t x, digit_t y)
{ // Is x < y?
    return ( int)((x ^ ((x ^ y) | ((x - y) ^ y))) >> (RADIX-1)); 
}


#define ADDC(carryIn, addend1, addend2, carryOut, sumOut) \
       { digit_t tempReg = (addend1) + (int)(carryIn);    \
                (sumOut) = (addend2) + tempReg;           \
              (carryOut) = (is_digit_lessthan_ct(tempReg, (int)(carryIn)) | is_digit_lessthan_ct((sumOut), tempReg)); \
 }

Now I got to know that the speed of this implementation can be improved using assembly language. So I am trying to do something similar however I cannot access or return the carry. Here is my code :

#include<stdio.h>
#include<stdlib.h>
#include<stdint.h>
uint64_t add32(uint64_t a,uint64_t b)
{
    uint64_t d=0,carry=0;
    __asm__("mov %1,%%rax\n\t"
            "adc %2,%%rax\n\t"
            "mov %%rax,%0\n\t"
            :"=r"(d)
            :"r"(a),"r"(b)
            :"%rax"
           );
    return d;
}
int main()
{
    uint64_t a=0xA234BDFA12CD4379,b=0xA8DB4567ACE92B38;
    printf("Sum = %lx \n",add32(a,b));
    return 0;
}

The result of this addition should be 14B100361BFB66EB1, where the initial 1 in msb is the carry. I want to save that carry in another register. I tried jc, but I'm getting some or the other error. Even setc gave me error, may be because I'm not sure of the syntax. So can anyone tell me how to save the carry in another register or return it by modifying this code?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 1
    Have you looked at the [docs](https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#FlagOutputOperands)? – David Wohlferd Oct 12 '17 at 04:28
  • 1
    On newer verions of GCC you can use the [`"=@ccc"(carry)`](https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html) output constraint to get the value of the carry flag and store it into the `carry` variable. Alternatively you can use the `setc` instruction in the extended assembly template to set the value of an output constraint to the value of the carry flag. – Michael Petch Oct 12 '17 at 04:29
  • Or you can `asm goto` and `jc` to a `return 1;` or fall through to a `return 0;` This can compile to good code if you're using it in an `if()`. – Peter Cordes Oct 12 '17 at 04:54
  • BTW, if the first or last instruction in your inline asm is a `mov`, you're usually doing it wrong. Tell the compiler you want to produce the output in the same register as one of the inputs, or use a `"+r"` read-write operand. See https://stackoverflow.com/tags/inline-assembly/info. – Peter Cordes Oct 12 '17 at 05:27
  • You only need `adc` if you want to support a carry *in* as well as a carry *out. What you're doing makes *no* sense. You're on x86-64, so the compiler will easily do 64-bit integer math for you. Use `add_carryout` from my answer. – Peter Cordes Oct 13 '17 at 00:50
  • There's no way your `add32` can return the carry-out as part of the `uint64_t` it returns. And no way a single `%lx` conversion can print a number that doesn't fit in 64 bits. Either print the carry separately or use `unsigned __int128`. – Peter Cordes Oct 13 '17 at 00:51
  • suppose I change add32 declaration as void add32(uint64_t a, uint64_t b, uint64_t *sum, uint64_t *carry) then I don't need to return. But as I said I cannot save the carry part or neither can I access the carry value. What is the way to print it or access within the function ? – Tanushree Banerjee Oct 13 '17 at 01:11
  • My answer already answers that. See `add_carryout()`. `sum = x + y; carry = (sum < x);` Pure C and compiles to good asm with gcc and clang, so that answers your question. – Peter Cordes Oct 13 '17 at 01:42
  • 1
    @TanushreeBanerjee : when you added the `"=@ccc"(carry)` constraint did you also change the assembler template by adjusting the parameters as they are no longer %0, %1, and %2. At the very least %1 would become %2 and %2 would become %3. I can see you getting errors in the assembler template if some of your parameters were off by 1 by inserting the new constraint. – Michael Petch Oct 13 '17 at 01:45
  • Yeah thanks that works. @PeterCordes. Also __int128 works. But one more thing as you said using the intrinsics of multiprecision arithmetic that is _addcarry_u64 in C, using the header file immintrin.h I have the following code : – Tanushree Banerjee Oct 13 '17 at 03:13

2 Answers2

4

As usual, inline asm is not strictly necessary. https://gcc.gnu.org/wiki/DontUseInlineAsm. But currently compilers kinda suck for actual extended-precision addition, so you might want asm for this.

There's an Intel intrinsic for adc: _addcarry_u64. But gcc and clang may make slow code., unfortunately. In GNU C on a 64-bit platform, you could just use unsigned __int128.


Compilers usually manage to make pretty good code when checking for carry-out from addition using the idiom that carry_out = (x+y) < x, where < is an unsigned compare. For example:

struct long_carry { unsigned long res; unsigned carry; };

struct long_carry add_carryout(unsigned long x, unsigned long y) {
    unsigned long retval = x + y;
    unsigned carry = (retval < x);
    return (struct long_carry){ retval, carry };
}

gcc7.2 -O3 emits this (and clang emits similar code):

    mov     rax, rdi        # because we need return value in a different register
    xor     edx, edx        # set up for setc
    add     rax, rsi        # generate carry
    setc    dl              # save carry.
    ret                     # return with rax=sum, edx=carry  (SysV ABI struct packing)

There's no way you can do better than this with inline asm; this function already looks optimal for modern CPUs. (Well I guess if mov wasn't zero latency, doing the add first would shorten the latency to carry being ready. But on Intel CPUs, it's supposed to be better to overwrite mov-elimination results right away, so it's better to mov first and then add.)


Clang will even use adc to use the carry-out from an add as the carry-in to another add, but only for the first limb. Perhaps because: Update: this function is broken: carry_out = (x+y) < x doesn't work when there's carry-in. With carry_out = (x+y+c_in) < x, y+c_in can wrap to zero and give you (x+0) < x (false) even though there was carry.

Notice that clang's cmp/adc reg,0 exactly implements the behaviour of the C, which isn't the same as another adc there.

Anyway, gcc doesn't even use adc the first time, when it is safe. (So use unsigned __int128 for code that doesn't suck, and asm for integers even wider than that).

// BROKEN with carry_in=1 and y=~0U
static
unsigned adc_buggy(unsigned long *sum, unsigned long x, unsigned long y, unsigned carry_in) {
    *sum = x + y + carry_in;
    unsigned carry = (*sum < x);
    return carry;
}

// *x += *y
void add256(unsigned long *x, unsigned long *y) {
    unsigned carry;
    carry = adc(x, x[0], y[0], 0);
    carry = adc(x+1, x[1], y[1], carry);
    carry = adc(x+2, x[2], y[2], carry);
    carry = adc(x+3, x[3], y[3], carry);
}

    mov     rax, qword ptr [rsi]
    add     rax, qword ptr [rdi]
    mov     qword ptr [rdi], rax

    mov     rax, qword ptr [rdi + 8]
    mov     r8, qword ptr [rdi + 16]   # hoisted
    mov     rdx, qword ptr [rsi + 8]
    adc     rdx, rax                   # ok, no memory operand but still adc
    mov     qword ptr [rdi + 8], rdx

    mov     rcx, qword ptr [rsi + 16]   # r8 was loaded earlier
    add     rcx, r8
    cmp     rdx, rax                    # manually check the previous result for carry.  /facepalm
    adc     rcx, 0

    ...

This sucks, so if you want extended-precision addition, you still need asm. But for getting the carry-out into a C variable, you don't.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • @TanushreeBanerjee: You mean for 32-bit mode? Compilers already use `adc` fairly optimally for `unsigned long long` (`uint64_t`) in 32-bit code. Edit your question with what exactly you're trying to speed up that the compiler doesn't already do optimally. – Peter Cordes Oct 12 '17 at 20:23
  • I have edited the question. So right now can you suggest me,what should be the modification and do you think it will speed up implementation?Say when it is done for 1 million times, will there be difference? – Tanushree Banerjee Oct 13 '17 at 00:39
  • Btw I also tried as @MichaelPetch said using "=@c(carry)", but since the type is defined uint64_t this gives negative weird results. When I changed the datatype of carry to int, it gives me errors in mov instruction. I tried with movl, movw, movx and nothing worked. – Tanushree Banerjee Oct 13 '17 at 00:48
  • Yes MichaelPetch and @Petercordes those work. Right now I was also looking at using intrinsics of Intel, but I'm not sure if I'm doing wrong in syntaxs, can you advise me please? – Tanushree Banerjee Oct 13 '17 at 03:21
  • 1
    @TanushreeBanerjee: My answer already linked a question that uses them correctly. Why don't you just **read my answer**; it's already answered everything you've asked in comments after I posted it. – Peter Cordes Oct 13 '17 at 03:23
1

for C:

bool add(bool const c, uint64_t a, uint64_t const b, uint64_t* const sum)
{
  *sum = a += b + c;
  return c ? a <= b : a < b;
  // return (((a <= b) ^ (a < b)) & c) ^ (a < b);
  // return ((a == b) & c) ^ (a < b);
  // return (a < b) || (c && (a == b));
}

Everything is working according to your specs.

BTW: You can find the reasoning behind the magic in the book "Modern Computer Arithmetic (Cambridge Monographs on Applied and Computational Mathematics Book 18) 1st Edition, Richard P. Brent (Author), Paul Zimmermann (Author)".

user1095108
  • 14,119
  • 9
  • 58
  • 116
  • 1
    Unfortunately compilers don't seem to optimize this into an add-carry instruction, even when inlining, which rather takes the fun out of it. See https://godbolt.org/z/e1ochqqE5. – Nate Eldredge Nov 05 '22 at 19:45
  • there may be some optimization and who knows what the future may bring. – user1095108 Nov 07 '22 at 16:41