CUDA - PTX carry propagation

Question

I want to add two 32-bit unsigned integers in CUDA PTX and I also want to take care of the carry propagation. I am using the code below to do that, but the result is not as expected.
Acording to the documentation, the add.cc.u32 d, a, b performs integer addition and writes the carry-out value into the condition code register, that is CC.CF.
On the other hand, addc.cc.u32 d, a, b performs integer addition with carry-in and writes the carry-out value into the condition code register. The semantics of this instruction would be
d = a + b + CC.CF. I also tryed addc.u32 d, a, b with no difference.

#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime_api.h>
#include "device_launch_parameters.h"
#include <cuda.h>

typedef unsigned int u32;
#define TRY_CUDA_CALL(x) \
do \
  { \
    cudaError_t err; \
    err = x; \
    if(err != cudaSuccess) \
  { \
    printf("Error %08X: %s at %s in line %d\n", err, cudaGetErrorString(err), __FILE__, __LINE__); \
    exit(err); \
  } \
} while(0)


__device__ u32
__uaddo(u32 a, u32 b) {
    u32 res;
    asm("add.cc.u32 %0, %1, %2; /* inline */ \n\t" 
        : "=r" (res) : "r" (a) , "r" (b));
    return res;
}

__device__ u32
__uaddc(u32 a, u32 b) {
    u32 res;
    asm("addc.cc.u32 %0, %1, %2; /* inline */ \n\t" 
        : "=r" (res) : "r" (a) , "r" (b));
    return res;
}

__global__ void testing(u32* s)
{
    u32 a, b;

    a = 0xffffffff;
    b = 0x2;
    
    s[0] = __uaddo(a,b);
    s[0] = __uaddc(0,0);

}

int main()
{
    u32 *s_dev;
    u32 *s;
    s = (u32*)malloc(sizeof(u32));
    TRY_CUDA_CALL(cudaMalloc((void**)&s_dev, sizeof(u32)));
    testing<<<1,1>>>(s_dev);
    TRY_CUDA_CALL( cudaMemcpy(s, s_dev, sizeof(u32), cudaMemcpyDeviceToHost) );
    
    printf("s = %d;\n",s[0]);
    
    
    return 1;
}

As far as I know, you get a carry if the result doesn't fit in the variable, which happens here and an overflow if the sign bit is corrupted, but I'm working with unsigned values.
The code above tries to add 0xFFFFFFFF to 0x2 and of course the result won't fit on 32-bit, so why I don't get a 1 after __uaddc(0,0) call?

EDIT

Nvidia Geforce GT 520mx
Windows 7 Ultimate, 64-bit
Visual Studio 2012
CUDA 7.0

See [this answer](http://stackoverflow.com/a/6220499/780717) for a worked example of how to use carry-propagation in PTX for multi-word arithmetic. — njuffa, Mar 26 '16 at 17:34
I used `add_uint128` from your answer and the carry propagation was working, but what is wrong with my aproach? The succesion `add.cc.u32` and `addc.cc.u32` is the same as I can see. — Dani Grosu, Mar 26 '16 at 18:27
The succesion is the same, but I'm using different calls. I think the register `CC.CF` should not change. — Dani Grosu, Mar 26 '16 at 18:34
Flags are ephemeral. Unless you re-use the carry flag in the *same* `asm` statement, there is no guarantee that it will still be available in a subsequent `asm` statement. If you must use multiple `asm` statements, you will need to "export" the carry flag setting into a C variable, and "import" it into the following `asm` statement. — njuffa, Mar 26 '16 at 19:31
You are saying it is possible that another process or thread to change the flag before I use the carry-in instruction? It seems right then. — Dani Grosu, Mar 26 '16 at 20:28
Not another process or thread. Other *instructions* generated from other source code lines can be interspersed with code from the `asm()` statements. So there can be no expectation that any flag setting, including the carry flag, would survive between two separate `asm()` statements, even if those statements appear directly next to each other in the source code. With separate `asm()` statements, there is no visible data dependency through the carry flag as far as the compiler is concerned, only variables bound by `asm()` statements establish such data-dependency. — njuffa, Mar 26 '16 at 20:48

score 3 · Accepted Answer · edited May 23 '17 at 11:58

3

The only data dependencies affecting an asm() statement are those that are explicitly expressed by the variable bindings. Note that you can bind register operands, but not the condition codes. Since in this code the result of __uaddo(a, b) is immediately being overwritten, the compiler determines that it does not contribute to the observable results, is therefore "dead code" and can be eliminated. This is easily checked by examining the generated machine code (SASS) for a release build with cuobjdump --dump-sass.

If we had slightly different code that does not allow the compiler to eliminate the code for __uaddo() outright, there would still be the issue that the compiler can schedule any instructions it likes between the code generated for __uaddo() and __uaddc(), and such instructions could destroy any setting of the carry flag due to __uaddo().

As a consequence, if one plans to use the carry flag for multi-word arithmetic, both carry-generating and carry-consuming instructions must occur in the same asm() statement. A worked example can be found in this answer that shows how to add 128-bit operands. Alternatively, if two separate asm() statements must be used, one could export the carry flag setting from the earlier one into a C variable, then import it into the subsequent asm() statement from there. I can't think of many situations where this would be practical, as the performance advantage of using the carry flag is likely lost.

edited May 23 '17 at 11:58

Community

1
1

answered Mar 26 '16 at 21:13

njuffa

23,970
4
78
130

Would adding the [volatile keyword](http://docs.nvidia.com/cuda/inline-ptx-assembly/index.html#incorrect-optimization) to the asm statements help? The docs say "To ensure that the asm is not deleted or moved, you should use the volatile keyword". – Frepa Jul 29 '16 at 13:51
1

As far as I am aware, the `volatile` keyword, when used with an `asm()` statement just controls what happens with the code *inside* that `asm()` statement, it does not control what happens *in between* two separate `asm()` statements. Therefore the use of `volatile` cannot ensure the survival of a carry flag setting between two separate `asm()` statements. – njuffa Jul 29 '16 at 19:12
@njuffa, I'm trying to write tests that include operations with carry, and as a result, some of my tests return incorrect output as a result of the carry. Is there any workaround or a way to reset the carry in this case? Thanks – Saeed Masoomi Mar 25 '22 at 17:07
@SaeedMasoomi Carry will work fine inside of an `asm()` statement, so there should be no problem. If you need to transport the value of the carry flag between `asm()` statements, you can save its value in an ordinary C++ `int` variable. – njuffa Mar 25 '22 at 19:14
@njuffa, Thanks but my issue occurs when I call a statement after a statement that has carry, for example, calling 3+2 will result in 6. (the carry come from another test), So is there any way to reset the carry to zero, I don't need carry in here – Saeed Masoomi Mar 25 '22 at 19:18
@SaeedMasoomi I have no idea what you mean. You can add without taking in a previous carry by using instruction `add` instead of `addc`. If you have a question that is on-topic for Stackoverflow, consider asking it. Entering into discussion in a comment trail to an existing question/answer is not the proper way for this site. – njuffa Mar 25 '22 at 20:03

score 1 · Answer 2 · answered Mar 26 '16 at 21:10

So, as @njuffa already said, other instructions from other source code can modify the CC.CF register between the two calls and there is no guarantee for getting the expected value of the register.
As a possible solution the __add32 function can be used:

__device__ uint2 __add32 (u32 a, u32 b)
{
    uint2 res;
    asm ("add.cc.u32      %0, %2, %3;\n\t"
         "addc.u32        %1, 0, 0;\n\t"
         : "=r"(res.x), "=r"(res.y)
         : "r"(a), "r"(b));
    return res;
}

The res.y will have the possible carry and res.x the result of addition.

CUDA - PTX carry propagation

2 Answers2