How to write LOP3 based instructions for Maxwell and up NVIDIA Architecture?

Question

Maxwell Architecture has introduced a new instruction in PTX assembly called LOP3 which according to the NVIDIA blog:

"Can save instructions when performing complex logic operations on multiple inputs."

At GTC 2016, some CUDA developers managed to accelerated the atan2f function for Tegra X1 processor (Maxwell) with such instructions.

However, the below function defined within a .cu file leads to undefined definitions for __SET_LT and __LOP3_0xe2.

Do I have to define them in .ptx file instead ? if so, how ?

float atan2f(const float dy, const float dx) 
{
 float flag, z = 0.0f;
 __SET_LT(flag, fabsf(dy), fabsf(dx));

 uint32_t m, t1 = 0x80000000; 
 float t2 = float(M_PI) / 2.0f;

 __LOP3_0x2e(m, __float_as_int(dx), t1, __float_as_int(t2));
 float w = flag * __int_as_float(m) + float(M_PI)/2.0f; 

 float Offset = copysignf(w, dy);
 float t = fminf(fabsf(dx), fabsf(dy)) / fmaxf(fabsf(dx), fabsf(dy));

 uint32_t r, b = __float_as_int(flag) << 2;
 uint32_t mask = __float_as_int(dx) ^ __float_as_int(dy) ^ (~b);
 __LOP3_0xe2(r, mask, t1, __floast_as_int(t));

 const float p = fabsf(__int_as_float(r)) - 1.0f;
 return ((-0.0663f*(-p) + 0.311f) * (-p) + float(float(M_PI)/4.0)) * (*(float *)&r) + Offset;
}

Edit:

The macro defines are finally:

#define __SET_LT(D, A, B) asm("set.lt.f32.f32 %0, %1, %2;" : "=f"(D) : "f"(A), "f"(B))
#define __SET_GT(D, A, B) asm("set.gt.f32.f32 %0, %1, %2;" : "=f"(D) : "f"(A), "f"(B))
#define __LOP3_0x2e(D, A, B, C) asm("lop3.b32 %0, %1, %2, %3, 0x2e;" : "=r"(D) : "r"(A), "r"(B), "r"(C))
#define __LOP3_0xe2(D, A, B, C) asm("lop3.b32 %0, %1, %2, %3, 0xe2;" : "=r"(D) : "r"(A), "r"(B), "r"(C))

Not sure where you came up with `__SET_LT` and `__LOP3_0xe2`. If you grabbed those from some online source I don't think you've identified it in your question. Anyway the most direct way to invoke a specific PTX instruction from CUDA C/C++ source code would be to use [inline PTX](http://docs.nvidia.com/cuda/inline-ptx-assembly/index.html#using-inline-ptx-assembly-in-cuda). — Robert Crovella, May 10 '16 at 22:22
http://on-demand.gputechconf.com/gtc/2016/presentation/s6108-max-lv-pedestrian-detection-tegra-x1.pdf — , May 10 '16 at 22:30
Thank you for the link, but I cannot find any examples with LOP3 — , May 10 '16 at 22:33
so are you asking a general question about how could I craft inline PTX to use a `lop3.b32` PTX instruction, or are you asking the specific question what exactly do I need to do to make the `__LOP3_0xe2` and other functions for the specific example to accelerate `atan2f` ? — Robert Crovella, May 11 '16 at 00:52
Atan2f is only a use case. I just would like to know how to convert standard instructions to lop3 in general. I could then update my question with working code after your answer. — , May 11 '16 at 09:06

Robert Crovella · Accepted Answer · 2020-01-18T19:39:25.213

The lop3.b32 PTX instruction can perform a more-or-less arbitrary boolean (logical) operation on 3 variables A,B, and C.

In order to set the actual operation to be performed, we must provide a "lookup-table" immediate argument (immLut -- an 8-bit quantity). As indicated in the documentation, a method to compute the necessary immLut argument for a given operation F(A,B,C) is to substitute the values of 0xF0 for A, 0xCC for B, and 0xAA for C in the actual desired equation. For example suppose we want to compute:

F = (A || B) && (!C)   ((A or B) and (not-C))

Then we would compute immLut argument by:

immLut = (0xF0 | 0xCC) & (~0xAA)

Note that the specified equation for F is a boolean equation, treating the arguments A,B, and C as boolean values, and producing a true/false result (F). However, the equation to compute immLut is a bitwise logical operation.

For the above example, immLut would have a computed value of 0x54

If it's desired to use a PTX instruction in ordinary CUDA C/C++ code, probably the most common (and arguably easiest) method would be to use inline PTX. Inline PTX is documented, and there are other questions discussing how to use it (such as this one), so I'll not repeat that here.

Here is a worked example of the above example case. Note that this particular PTX instruction is only available on cc5.0 and higher architectures, so be sure to compile for at least that level of target.

$ cat t1149.cu
#include <stdio.h>

const unsigned char A_or_B_and_notC=((0xF0|0xCC)&(~0xAA));

__device__ int my_LOP_0x54(int A, int B, int C){
  int temp;
  asm("lop3.b32 %0, %1, %2, %3, 0x54;" : "=r"(temp) : "r"(A), "r"(B), "r"(C));
  return temp;
}

__global__ void testkernel(){

  printf("A=true, B=false, C=true,   F=%d\n", my_LOP_0x54(true, false, true));
  printf("A=true, B=false, C=false,  F=%d\n", my_LOP_0x54(true, false, false));
  printf("A=false, B=false, C=false, F=%d\n", my_LOP_0x54(false, false, false));
}


int main(){

  printf("0x%x\n", A_or_B_and_notC);
  testkernel<<<1,1>>>();
  cudaDeviceSynchronize();
}
$ nvcc -arch=sm_50 -o t1149 t1149.cu
$ ./t1149
0x54
A=true, B=false, C=true,   F=0
A=true, B=false, C=false,  F=1
A=false, B=false, C=false, F=0
$

Since immLut is an immediate constant in PTX code, I know of no way using inline PTX to pass this as a function parameter - even if templating is used. Based on your provided link, it seems that the authors of that presentation also used a separately defined function for the specific desired immediate value -- presumably 0xE2 and 0x2E in their case. Also, note that I have chosen to write my function so that it returns the result of the operation as the function return value. The authors of the presentation you linked appear to be passing the return value back via a function parameter. Either method should be workable. (In fact, it appears they have written their __LOP3... codes as functional macros rather than ordinary functions.)

Also see here for a method of understanding how the 8 bit truthtable (immLut) works for LOP3 at the source code level.

How to write LOP3 based instructions for Maxwell and up NVIDIA Architecture?

1 Answers1

Linked