Questions tagged [ptx]

Parallel Thread Execution (PTX) is a virtual machine instruction set architecture used in Nvidia's CUDA programming environment.

nVIDIA's GPUs have differing micro-architectures, the changes between which are not always incremental (like the addition of instructions to the x86-64 with successive avx extensions). They all, however, share an intermediate (virtual) instruction set, somewhat similar to a compiler's intermediate representation. Specifically, it is somewhat of a parallel to the OpenCL-standard-related representation, spir-v. Continuing the compilation toolchain, PTX is further compiled into one of several GPU-microarchitecture specific assembly languages (sass) for actual execution.

Here is an example of a simple CUDA kernel and the PTX resulting from its compilation:

__global__ void square(int *array, int length) {
    int pos = threadIdx.x + blockIdx.x * blockDim.x;
    if (pos < length)
        array[pos] = array[pos] * array[pos];
}

Resulting PTX (after name demangling):

.visible .entry square(int*, int)(
        .param .u64 square(int*, int)_param_0,
        .param .u32 square(int*, int)_param_1
)
{
        ld.param.u64        %rd1, [square(int*, int)_param_0];
        ld.param.u32        %r2, [square(int*, int)_param_1];
        mov.u32             %r3, %tid.x;
        mov.u32             %r4, %ntid.x;
        mov.u32             %r5, %ctaid.x;
        mad.lo.s32          %r1, %r4, %r5, %r3;
        setp.ge.s32         %p1, %r1, %r2;
        @%p1 bra            BB0_2;

        cvta.to.global.u64  %rd2, %rd1;
        mul.wide.s32        %rd3, %r1, 4;
        add.s64             %rd4, %rd2, %rd3;
        ld.global.u32       %r6, [%rd4];
        mul.lo.s32          %r7, %r6, %r6;
        st.global.u32       [%rd4], %r7;

        ret;
}

For more information on PTX in general, and on the specific instructions and data access syntax in the example above, consult the nVIDIA PTX Referene.

164 questions

votes

1 answer

CUDA: How to use -arch and -code and SM vs COMPUTE

I am still not sure how to properly specify the architectures for code generation when building with nvcc. I am aware that there is machine code as well as PTX code embedded in my binary and that this can be controlled via the controller switches…

cuda nvcc ptx fat-binaries

asked Feb 26 '16 at 15:59

bweber

3,772
3
32
57

votes

2 answers

What is the purpose of using multiple "arch" flags in Nvidia's NVCC compiler?

I've recently gotten my head around how NVCC compiles CUDA device code for different compute architectures. From my understanding, when using NVCC's -gencode option, "arch" is the minimum compute architecture required by the programmer's…

cuda nvcc ptx

asked Jul 11 '13 at 16:54

James Paul Turner

votes

1 answer

Funnel shift - what is it?

When reading through CUDA 5.0 Programming Guide I stumbled on a feature called "Funnel shift" which is present in 3.5 compute-capable device, but not 3.0. It contains an annotation "see reference manual", but when I search for the "funnel shift"…

cuda intrinsics ptx

asked Oct 07 '12 at 08:00

CygnusX1

20,968
5
65
109

votes

3 answers

PTX - what is a CTA?

I'm studying PTX and I don't understand how a CTA (compute thread array) is different from a CUDA block. Are they the same thing? It seems to me that for now (I'm just at the beginning of the PTX document) they're just the same

cuda nvidia gpu ptx

asked Jul 15 '13 at 08:13

Marco A.

43,032
26
132
246

votes

1 answer

What's the difference between PTX and CUBIN w.r.t. the NVCC compiler?

I have CUDA 4.0 installed, and a device with Compute Capability 2.0 (a GTX 460 card). What is the difference between the 'cubin' and the 'ptx' file? I think the cubin is a native code for the gpu so this is micro-architecture specific, and the ptx…

cuda nvidia nvcc ptx

asked Oct 08 '11 at 10:35

user973764

votes

2 answers

What's the most efficient way to calculate the warp id / lane id in a 1-D grid?

In CUDA, each thread knows its block index in the grid and thread index within the block. But two important values do not seem to be explicitly available to it: Its index as a lane within its warp (its "lane id") The index of the warp of which it…

optimization cuda ptx

asked Jun 02 '17 at 20:55

einpoklum

118,144
57
340
684

votes

3 answers

CUDA disable L1 cache only for one variable

Is there any way on CUDA 2.0 devices to disable L1 cache only for one specific variable? I know that one can disable L1 cache at compile time adding the flag -Xptxas -dlcm=cg to nvcc for all memory operations. However, I want to disable cache only…

caching assembly cuda cpu-cache ptx

asked Sep 23 '12 at 14:20

zeus2

votes

6 answers

How to compile PTX code

I need to modify the PTX code and compile it directly. The reason is that I want to have some specific instructions right after each other and it is difficult to write a cuda code that results my target PTX code, So I need to modify ptx code…

cuda nvcc ptx

asked Nov 15 '13 at 23:17

user2998135

votes

3 answers

How to output C/C++ annotated PTX in CUDA 4.1/4.2/5.0

Does anybody know how to get PTX assembler annotated with C/C++ code with new LLVM back-end? Can easily get it with CUDA 4.0 or earlier but NVCC rejects all my flags after upgrading CUDA toolkit to version 4.2.

cuda llvm ptx

asked Jul 13 '12 at 23:02

Andrey Kamaev

29,582
6
94
88

votes

1 answer

Can I prefetch specific data to a specific cache level in a CUDA kernel?

I understand that Fermi GPUs support prefetching to L1 or L2 cache. However, in the CUDA reference manual I can not find any thing about it. Dues CUDA allow my kernel code to prefetch specific data to a specific level of cache?

caching cuda gpgpu prefetch ptx

asked Jan 21 '11 at 04:08

dalibocai

2,289
5
29
45

votes

1 answer

Confusion with CUDA PTX code and register memory

:) While I was trying to manage my kernel resources I decided to look into PTX but there are a couple of things that I do not understand. Here is a very simple kernel I wrote: __global__ void foo(float* out, float* in, uint32_t n) { uint32_t idx…

cuda ptx

asked Jun 07 '13 at 03:20

AstrOne

3,569
7
32
54

votes

1 answer

CUDA device stack and synchronization; SSY instruction

Edit: this question is a re-done version of the original, so the first several responses may no longer be relevant. I'm curious about what impact a device function call with forced no-inlining has on synchronization within a device function. I have…

cuda ptx

asked Sep 11 '12 at 20:26

user1663964

votes

1 answer

Should I look into PTX to optimize my kernel? If so, how?

Do you recommend reading your kernel's PTX code to find out to optimize your kernels further? One example: I read, that one can find out from the PTX code if the automatic loop unrolling worked. If this is not the case, one would have to unroll the…

performance cuda gpgpu ptx loop-unrolling

asked Nov 10 '11 at 14:22

Framester

33,341
51
130
192

votes

2 answers

Is it possible to put assembly instructions into CUDA code?

I want to use assembly code in CUDA C code in order to reduce expensive executions as we do using asm in c programming. Is it possible?

c assembly cuda inline-assembly ptx

asked Sep 09 '10 at 13:55

superscalar

votes

1 answer

How to generate, compile and run CUDA kernels at runtime

Well, I have quite a delicate question :) Let's start with what I have: Data, large array of data, copied to GPU Program, generated by CPU (host), which needs to be evaluated for every data in that array The program changes very frequently, can be…

cuda compilation gpgpu ptx

asked Nov 07 '13 at 14:28

teejay

2,353
2
27
36

2 3

…

10 11 Next