When is the (default-variant) PTX instruction `prmt` useful?

Question

PTX has a prmt instruction with many variants. This question regards the default one, which, if formatted as a C/C++ function, would look like this:

uint32_t prmt(uint32_t a, uint32_t b, uint32_t byte_selectors);

and this is what it does (adapted from the official docs):

In the generic form (no mode specified), byte_selectors consists of four 4-bit selection values. The bytes in the two source parameters a and b are numbered from 0 to 7: {b, a} = {{b7, b6, b5, b4}, {b3, b2, b1, b0}}. For each byte in the function's output, a 4-bit selection value is defined.

The 3 lsbs of the selection value specify which of the 8 source bytes should be moved into the target position. The msb defines if the byte value should be copied, or if the sign (msb of the byte) should be replicated over all 8 bits of the target position (sign extend of the byte value); msb=0 means copy the literal value; msb=1 means replicate the sign.

My question: When is this kind of operation useful? What kind of computation can make use of it?

[this](https://devtalk.nvidia.com/default/topic/1070921/cuda-programming-and-performance/mov-confusion/post/5425781/#5425781) may be of interest — Robert Crovella, Feb 17 '20 at 14:43
@RobertCrovella: But the extraction is something you can achieve with non-default modes of PRMT, IIANM. The default mode has this weird sign-replication option. I'm sure there's some sort of use case motivating these specific semantics. Maybe something to do with graphics? — einpoklum, Feb 17 '20 at 15:16
Does this answer your question? [Reading from an unaligned uint8\_t recast as a uint32\_t array - not getting all values](https://stackoverflow.com/questions/40194012/reading-from-an-unaligned-uint8-t-recast-as-a-uint32-t-array-not-getting-all-v) — tera, Feb 17 '20 at 18:45
Actually I meant a different post, but I can't immediately find it: if I remember correctly, I've posted some example code for atomic operations on bytes a couple of years ago, probably on the Nvidia forums and not here. I used the default prmt mode with byte selectors in there. I've also occasionally used it in other places. — tera, Feb 17 '20 at 18:55
@njuffa: Surprisingly, `__byte_perm()` doesn't use the default mode :-( — einpoklum, Feb 17 '20 at 19:24
@njuffa: That's what I meant. That is, in this question. Permuting bytes is pretty obviously useful in some cases, it's this "sometimes copy, sometimes replicate sign" that I was wondering about. — einpoklum, Feb 17 '20 at 19:40
@njuffa: Oh, and by the way... "inline PTX assembly must be used" <- or [this](https://github.com/eyalroz/cuda-kat/blob/development/src/kat/on_device/ptx/miscellany.cuh#L152). I'm working on unit-testing that stuff right now, which is how I noticed the `__byte_perm()` behavior. — einpoklum, Feb 17 '20 at 19:44

njuffa · Answer 1 · 2020-12-18T03:27:15.353

The PTX instruction prmt exposes the functionality of the machine instruction PRMT. The default mode of the prmt instruction is used when none of the special modes .f4e, .b4e, .rc8, .ecl, .ecr, .rc16 is specified.

The default mode has two per-byte sub-modes, controlled by the most significant bit of the 4-bit selector field for each of eight source bytes. The commonly used sub-mode is to have the msb of the selector field zero, which means the destination byte is copied verbatim from the specified source byte. This sub-mode is exposed via a device function intrinsic __byte_perm(), and generally used to extract, insert, and permute bytes or perform bit shifts by multiples of 8. Example usage can be seen in this answer.

The other sub-mode is special, in that instead of copying the entire source byte, it replicates the most significant bit of the specified source byte across the destination byte. For this, the msb of the selector field needs to be set to one. Programmers must use PTX inline assembly to access this functionality.

I did not design the GPU hardware, so cannot speak to why that sub-mode was implemented. It is generally useful when the msb of each byte serves as a boolean value that one needs to convert into a mask for the whole byte. This in turn is generally useful for byte-wise processing within a 32-bit register. Note that CUDA includes a good many device function intrinsics for such processing, and disassembly will confirm that the msb replication sub-mode of the prmt default mode is used for many of those.

A fully worked example, emulation of the paddsb operation (byte-wise addition with signed saturation), is shown below. Note the use of prmt with msb replication inside masked_sign_to_byte_mask().

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

#if (__CUDACC__)
#define __HOST__ __host__
#define __DEVICE__ __device__
#else // __CUDACC__
#define __HOST__
#define __DEVICE__
#endif // __CUDACC__

#define MSB_MASK (0x80808080U)  // mask for msb of each byte

// r = (a ^ b) & ~c
__HOST__ __DEVICE__ uint32_t lop3_14 (uint32_t a, uint32_t b, uint32_t c)
{
    uint32_t r;
#if (__CUDA_ARCH__ >= 500)
    asm ("lop3.b32 %0,%1,%2,%3,0x14;\n\t" : "=r"(r) : "r"(a), "r"(b), "r"(c));
#else // __CUDA_ARCH__
    r = (a ^ b) & ~c;
#endif // __CUDA_ARCH__
    return r;
}

// r = (a ^ b) & c
__HOST__ __DEVICE__ uint32_t lop3_28 (uint32_t a, uint32_t b, uint32_t c)
{
    uint32_t r;
#if (__CUDA_ARCH__ >= 500)
    asm ("lop3.b32 %0,%1,%2,%3,0x28;\n\t" : "=r"(r) : "r"(a), "r"(b), "r"(c));
#else // __CUDA_ARCH__
    r = (a ^ b) & c;
#endif // __CUDA_ARCH__
    return r;
}

// r = a ^ (~b & c)
__HOST__ __DEVICE__ uint32_t lop3_d2 (uint32_t a, uint32_t b, uint32_t c)
{
    uint32_t r;
#if (__CUDA_ARCH__ >= 500)
    asm ("lop3.b32 %0,%1,%2,%3,0xd2;\n\t" : "=r"(r) : "r"(a), "r"(b), "r"(c));
#else // __CUDA_ARCH__
    r = a ^ (~b & c);
#endif // __CUDA_ARCH__ 
    return r;
}

// r = (a & c) | (b & ~c)
__HOST__ __DEVICE__ uint32_t lop3_f4 (uint32_t a, uint32_t b, uint32_t c)
{
    uint32_t r;
#if (__CUDA_ARCH__ >= 500)
    asm ("lop3.b32 %0,%1,%2,%3,0xf4;\n\t" : "=r"(r) : "r"(a), "r"(b), "r"(c));
#else // __CUDA_ARCH__
    r = (a & c) | (b & ~c);
#endif // __CUDA_ARCH__
    return r;
} 

__HOST__ __DEVICE__ uint32_t masked_sign_to_byte_mask (uint32_t a)
{
#if (__CUDA_ARCH__ >= 200)
    asm ("prmt.b32 %0,%0,0,0xba98;" : "+r"(a)); // convert MSBs to masks
#else
    a = a & MSB_MASK;
    a = a + a - (a >> 7); // extend MSBs to full byte to create mask
#endif
    return a;
}

__HOST__ __DEVICE__ uint32_t masked_select (uint32_t a, uint32_t b, uint32_t m)
{
#if (__CUDA_ARCH__ >= 500) 
    return lop3_f4 (a, b, m);
#elif 0
    return (((a)&(m))|((b)&(~(m))));
#else
    return((((a)^(b))&(m))^(b));
#endif
}

/* 
   my_paddsb() performs byte-wise addition with signed saturation. In the 
   case of overflow, positive results are clamped at 127, while negative 
   results are clamped at -128.
*/
__HOST__ __DEVICE__ uint32_t my_paddsb (uint32_t a, uint32_t b)
{
    uint32_t sum, res, ofl, sga, msk;
    res = (a & ~MSB_MASK) + (b & ~MSB_MASK);
    sum = a ^ b;
    ofl = lop3_14 (res, a, sum); // ofl = (res ^ a) & ~sum
    sga = masked_sign_to_byte_mask (a);  // sign(a)-mask
    msk = masked_sign_to_byte_mask (ofl);// overflow-mask
    res = lop3_d2 (res, ~MSB_MASK, sum); // res = res ^ (MSB_MASK & sum)
    sga = lop3_28 (sga, ~MSB_MASK, msk); // sga = (sga ^ ~MSB_MASK) & msk
    res = masked_select (sga, res, msk); // res = (sga & msk) | (res & ~msk)
    return res;
}

__global__ void kernel (uint32_t a, uint32_t b)
{
    printf ("GPU: %08x\n", my_paddsb (a, b));
}

int main (void)
{
    uint32_t a = 0x12ef70a0;
    uint32_t b = 0x34cd6090;
    kernel<<<1,1>>>(a, b);
    cudaDeviceSynchronize();
    printf ("CPU: %08x\n", my_paddsb (a, b));
    return EXIT_SUCCESS;
}

That looks so baroque... is this really faster than the straightforward version? 8-| — einpoklum, Feb 17 '20 at 21:12
@einpoklum Give it a try, time it and let us know which version is faster. From memory, on a Pascal-class GPU, `my_paddsb` is twice as fast as a functionally equivalent straightforward solution, but then you may be thinking of a straightforward solution that is different from the one I tried. Simply calling `__vaddss4()` is of course also straightforward for device code, but I was trying to have a completely portable code for host and device. — njuffa, Feb 17 '20 at 21:20

When is the (default-variant) PTX instruction `prmt` useful?

1 Answers1