When writing PTX in a separate file, a kernel parameter can be loaded into a register with:
.reg .u32 test;
ld.param.u32 test, [test_param];
However, when using inline PTX, the Using Inline PTX Assembly in CUDA (version 01) application note describes a syntax where loading a parameter is closely linked to another operation. It provides this example:
asm("add.s32 %0, %1, %2;" : "=r"(i) : "r"(j), "r"(k));
Which generates:
ld.s32 r1, [j];
ld.s32 r2, [k];
add.s32 r3, r1, r2;
st.s32 [i], r3;
In many cases, it is necessary to separate the two operations. For instance, one might want to store the parameter in a register outside of a loop and then reuse and modify the register inside a loop. The only way I have found to do this is to use an extra mov instruction, to move the parameter from the register to which it was implicitly loaded, to another register I can use later.
Is there a way to avoid this additional mov instruction when moving from PTX in a separate file to inline PTX?