0

When we write inline PTX assembly in our generally C/C++ CUDA code, e.g.:

__device__ __inline__ uint32_t bfind(uint32_t val)
{
    uint32_t ret;
    asm ("bfind.u32 %0, %1;" : "=r"(ret): "r"(val));
    return ret;
}

we can add the volatile keyword after asm, e.g.:

__device__ __inline__ uint32_t bfind(uint32_t val)
{
    uint32_t ret;
    asm volatile ("bfind.u32 %0, %1;" : "=r"(ret): "r"(val));
    return ret;
}

The CUDA documentation on inline PTX assembly says:

The compiler assumes that an asm() statement has no side effects except to change the output operands. To ensure that the asm is not deleted or moved during generation of PTX, you should use the volatile keyword

I don't understand what that means. So,

  • Why would my asm() be deleted? Or rather, if the compiler notices it has no effect, why should I mind it being deleted?
  • Why is it a problem if my asm() is moved during the generation of PTX? That's part of the optimization process, isn't it?
  • How would one characterize the compiler's behavior more exactly when facing non-volatile and volatile asm() instructions respectively?
einpoklum
  • 118,144
  • 57
  • 340
  • 684

1 Answers1

3

Why would my asm() be deleted? Or rather, if the compiler notices it has no effect, why should I mind it being deleted?

If the compiler detects that your inline PTX doesn't contribute to changing state at anything other than at thread local scope, it feels free to delete it as an optimization. Generally speaking, that is exactly what you want to happen. But sometimes, it isn't. Your intentions and the compiler's optimization strategy might not always intersect in ways you either want or expect. Caveat emptor and all that.

Why is it a problem if my asm() is moved during the generation of PTX? That's part of the optimization process, isn't it?

It is not a problem, and is a part of the optimization process; but sometimes you might want to circumvent that. Imagine you are crafting micro-benchmarks and the compiler decides to reorder your carefully designed sequence of instructions you coded in inline PTX (the classic case is moving calls to the wrong place in emitted code so that timing sections or memory transaction pattern designs get broken). The results wouldn't be what you intended. I would imagine that could be pretty frustrating.

How would one characterize the compiler's behavior more exactly when facing non-volatile and volatile asm() instructions respectively?

As with standard CUDA kernel code, volatile ensures that compiler honors emitting a given inline PTX operation in its output, rather than exposing it to being optimized away by code analysis.

talonmies
  • 70,661
  • 34
  • 192
  • 269