Questions tagged [ptx]

Parallel Thread Execution (PTX) is a virtual machine instruction set architecture used in Nvidia's CUDA programming environment.

nVIDIA's GPUs have differing micro-architectures, the changes between which are not always incremental (like the addition of instructions to the with successive extensions). They all, however, share an intermediate (virtual) instruction set, somewhat similar to a compiler's intermediate representation. Specifically, it is somewhat of a parallel to the OpenCL-standard-related representation, . Continuing the compilation toolchain, PTX is further compiled into one of several GPU-microarchitecture specific assembly languages () for actual execution.

Here is an example of a simple CUDA kernel and the PTX resulting from its compilation:

__global__ void square(int *array, int length) {
    int pos = threadIdx.x + blockIdx.x * blockDim.x;
    if (pos < length)
        array[pos] = array[pos] * array[pos];
}

Resulting PTX (after name demangling):

.visible .entry square(int*, int)(
        .param .u64 square(int*, int)_param_0,
        .param .u32 square(int*, int)_param_1
)
{
        ld.param.u64        %rd1, [square(int*, int)_param_0];
        ld.param.u32        %r2, [square(int*, int)_param_1];
        mov.u32             %r3, %tid.x;
        mov.u32             %r4, %ntid.x;
        mov.u32             %r5, %ctaid.x;
        mad.lo.s32          %r1, %r4, %r5, %r3;
        setp.ge.s32         %p1, %r1, %r2;
        @%p1 bra            BB0_2;

        cvta.to.global.u64  %rd2, %rd1;
        mul.wide.s32        %rd3, %r1, 4;
        add.s64             %rd4, %rd2, %rd3;
        ld.global.u32       %r6, [%rd4];
        mul.lo.s32          %r7, %r6, %r6;
        st.global.u32       [%rd4], %r7;

        ret;
}

For more information on PTX in general, and on the specific instructions and data access syntax in the example above, consult the nVIDIA PTX Referene.

164 questions
51
votes
1 answer

CUDA: How to use -arch and -code and SM vs COMPUTE

I am still not sure how to properly specify the architectures for code generation when building with nvcc. I am aware that there is machine code as well as PTX code embedded in my binary and that this can be controlled via the controller switches…
bweber
  • 3,772
  • 3
  • 32
  • 57
43
votes
2 answers

What is the purpose of using multiple "arch" flags in Nvidia's NVCC compiler?

I've recently gotten my head around how NVCC compiles CUDA device code for different compute architectures. From my understanding, when using NVCC's -gencode option, "arch" is the minimum compute architecture required by the programmer's…
James Paul Turner
  • 791
  • 3
  • 8
  • 23
16
votes
1 answer

Funnel shift - what is it?

When reading through CUDA 5.0 Programming Guide I stumbled on a feature called "Funnel shift" which is present in 3.5 compute-capable device, but not 3.0. It contains an annotation "see reference manual", but when I search for the "funnel shift"…
CygnusX1
  • 20,968
  • 5
  • 65
  • 109
15
votes
3 answers

PTX - what is a CTA?

I'm studying PTX and I don't understand how a CTA (compute thread array) is different from a CUDA block. Are they the same thing? It seems to me that for now (I'm just at the beginning of the PTX document) they're just the same
Marco A.
  • 43,032
  • 26
  • 132
  • 246
14
votes
1 answer

What's the difference between PTX and CUBIN w.r.t. the NVCC compiler?

I have CUDA 4.0 installed, and a device with Compute Capability 2.0 (a GTX 460 card). What is the difference between the 'cubin' and the 'ptx' file? I think the cubin is a native code for the gpu so this is micro-architecture specific, and the ptx…
user973764
  • 141
  • 1
  • 1
  • 3
13
votes
2 answers

What's the most efficient way to calculate the warp id / lane id in a 1-D grid?

In CUDA, each thread knows its block index in the grid and thread index within the block. But two important values do not seem to be explicitly available to it: Its index as a lane within its warp (its "lane id") The index of the warp of which it…
einpoklum
  • 118,144
  • 57
  • 340
  • 684
13
votes
3 answers

CUDA disable L1 cache only for one variable

Is there any way on CUDA 2.0 devices to disable L1 cache only for one specific variable? I know that one can disable L1 cache at compile time adding the flag -Xptxas -dlcm=cg to nvcc for all memory operations. However, I want to disable cache only…
zeus2
  • 309
  • 2
  • 11
10
votes
6 answers

How to compile PTX code

I need to modify the PTX code and compile it directly. The reason is that I want to have some specific instructions right after each other and it is difficult to write a cuda code that results my target PTX code, So I need to modify ptx code…
user2998135
  • 111
  • 1
  • 1
  • 3
10
votes
3 answers

How to output C/C++ annotated PTX in CUDA 4.1/4.2/5.0

Does anybody know how to get PTX assembler annotated with C/C++ code with new LLVM back-end? Can easily get it with CUDA 4.0 or earlier but NVCC rejects all my flags after upgrading CUDA toolkit to version 4.2.
Andrey Kamaev
  • 29,582
  • 6
  • 94
  • 88
9
votes
1 answer

Can I prefetch specific data to a specific cache level in a CUDA kernel?

I understand that Fermi GPUs support prefetching to L1 or L2 cache. However, in the CUDA reference manual I can not find any thing about it. Dues CUDA allow my kernel code to prefetch specific data to a specific level of cache?
dalibocai
  • 2,289
  • 5
  • 29
  • 45
8
votes
1 answer

Confusion with CUDA PTX code and register memory

:) While I was trying to manage my kernel resources I decided to look into PTX but there are a couple of things that I do not understand. Here is a very simple kernel I wrote: __global__ void foo(float* out, float* in, uint32_t n) { uint32_t idx…
AstrOne
  • 3,569
  • 7
  • 32
  • 54
8
votes
1 answer

CUDA device stack and synchronization; SSY instruction

Edit: this question is a re-done version of the original, so the first several responses may no longer be relevant. I'm curious about what impact a device function call with forced no-inlining has on synchronization within a device function. I have…
7
votes
1 answer

Should I look into PTX to optimize my kernel? If so, how?

Do you recommend reading your kernel's PTX code to find out to optimize your kernels further? One example: I read, that one can find out from the PTX code if the automatic loop unrolling worked. If this is not the case, one would have to unroll the…
Framester
  • 33,341
  • 51
  • 130
  • 192
7
votes
2 answers

Is it possible to put assembly instructions into CUDA code?

I want to use assembly code in CUDA C code in order to reduce expensive executions as we do using asm in c programming. Is it possible?
superscalar
  • 635
  • 2
  • 9
  • 15
7
votes
1 answer

How to generate, compile and run CUDA kernels at runtime

Well, I have quite a delicate question :) Let's start with what I have: Data, large array of data, copied to GPU Program, generated by CPU (host), which needs to be evaluated for every data in that array The program changes very frequently, can be…
teejay
  • 2,353
  • 2
  • 27
  • 36
1
2 3
10 11