Parallel Thread Execution (PTX) is a virtual machine instruction set architecture used in Nvidia's CUDA programming environment.
nVIDIA's GPUs have differing micro-architectures, the changes between which are not always incremental (like the addition of instructions to the x86-64 with successive avx extensions). They all, however, share an intermediate (virtual) instruction set, somewhat similar to a compiler's intermediate representation. Specifically, it is somewhat of a parallel to the OpenCL-standard-related representation, spir-v. Continuing the compilation toolchain, PTX is further compiled into one of several GPU-microarchitecture specific assembly languages (sass) for actual execution.
Here is an example of a simple CUDA kernel and the PTX resulting from its compilation:
__global__ void square(int *array, int length) {
int pos = threadIdx.x + blockIdx.x * blockDim.x;
if (pos < length)
array[pos] = array[pos] * array[pos];
}
Resulting PTX (after name demangling):
.visible .entry square(int*, int)(
.param .u64 square(int*, int)_param_0,
.param .u32 square(int*, int)_param_1
)
{
ld.param.u64 %rd1, [square(int*, int)_param_0];
ld.param.u32 %r2, [square(int*, int)_param_1];
mov.u32 %r3, %tid.x;
mov.u32 %r4, %ntid.x;
mov.u32 %r5, %ctaid.x;
mad.lo.s32 %r1, %r4, %r5, %r3;
setp.ge.s32 %p1, %r1, %r2;
@%p1 bra BB0_2;
cvta.to.global.u64 %rd2, %rd1;
mul.wide.s32 %rd3, %r1, 4;
add.s64 %rd4, %rd2, %rd3;
ld.global.u32 %r6, [%rd4];
mul.lo.s32 %r7, %r6, %r6;
st.global.u32 [%rd4], %r7;
ret;
}
For more information on PTX in general, and on the specific instructions and data access syntax in the example above, consult the nVIDIA PTX Referene.