13

In CUDA, each thread knows its block index in the grid and thread index within the block. But two important values do not seem to be explicitly available to it:

  • Its index as a lane within its warp (its "lane id")
  • The index of the warp of which it is a lane within the block (its "warp id")

Assuming the grid is 1-dimensional(a.k.a. linear, i.e. blockDim.y and blockDim.z are 1), one can obviously obtain these as follows:

enum : unsigned { warp_size = 32 };
auto lane_id = threadIdx.x % warp_size;
auto warp_id = threadIdx.x / warp_size;

and if you don't trust the compiler to optimize that, you could rewrite it as:

enum : unsigned { warp_size = 32, log_warp_size = 5 };
auto lane_id = threadIdx.x & (warp_size - 1);
auto warp_id = threadIdx.x >> log_warp_size;

is that the most efficient thing to do? It still seems like a lot of waste for every thread to have to compute this.

(inspired by this question.)

einpoklum
  • 118,144
  • 57
  • 340
  • 684
  • CUDA programming guide uses `threadIdx.x & 0x1f` a few times in their examples to get the lane id, which is equivalent to `threadIdx.x % 32` – Azmisov Feb 08 '22 at 06:45

2 Answers2

19

The naive computation is currently the most efficient.

Note: This answer has been heavily edited.

It is very tempting to try and avoid the computation altogether - as these two values seem to already be available if you look under the hood.

You see, nVIDIA GPUs have special registers which your (compiled) code can read to access various kinds of useful information. One such register holds threadIdx.x; another holds blockDim.x; another - the clock tick count; and so on. C++ as a language does not have these exposed, obviously; and, in fact, neither does CUDA. However, the intermediary representation into which CUDA code is compiled, named PTX, does expose these special registers (since PTX 1.3, i.e. with CUDA versions >= 2.1).

Two of these special registers are %warpid and %laneid. Now, CUDA supports inlining PTX code within CUDA code with the asm keyword - just like it can be used for host-side code to emit CPU assembly instructions directly. With this mechanism one can use these special registers:

__forceinline__ __device__ unsigned lane_id()
{
    unsigned ret; 
    asm volatile ("mov.u32 %0, %laneid;" : "=r"(ret));
    return ret;
}

__forceinline__ __device__ unsigned warp_id()
{
    // this is not equal to threadIdx.x / 32
    unsigned ret; 
    asm volatile ("mov.u32 %0, %warpid;" : "=r"(ret));
    return ret;
}

... but there are two problems here.

The first problem - as @Patwie suggests - is that %warp_id does not give you what you actually want - it's not the index of the warp in the context of the grid, but rather in the context of the physical SM (which can hold so many warps resident at a time), and those two are not the same. So don't use %warp_id.

As for %lane_id, it does give you the correct value, but it will almost surely hurt your performance: Even though it's a "register", it's not like the regular registers in your register file, with 1-cycle access latency. It's a special register, which in the actual hardware is retrieved using an S2R instruction, which can exhibit long latency. Since you almost certainly already have the value of threadIdx.x in a register, it is faster to apply a bitmask to this value than to retrieve %lane_id.


Bottom line: Just compute the warp ID and lane ID from the thread ID. We can't get around this - for now.
einpoklum
  • 118,144
  • 57
  • 340
  • 684
  • Cool trick! Is that supported in all CUDA versions and architectures? – dari Jun 02 '17 at 21:02
  • 2
    In the link you provided it is stated: " PTX ISA Notes: Introduced in PTX ISA version 1.3." and "Target ISA Notes: Supported on all target architectures." And from the [release notes](http://docs.nvidia.com/cuda/parallel-thread-execution/index.html#release-notes) you will get that PTX 1.3 was introduced with CUDA 2.1. – BlameTheBits Jun 02 '17 at 21:13
  • @Shadow: Thanks for that. – einpoklum Jun 02 '17 at 21:22
  • 2
    Have you benchmarked this? When I last tried this (on a compute capability 6.1 device IIRC) it turned out to be _slower_ than using `threadIdx.x >> 5` and `threadIdx.x & 31`, but I didn't investigate any further. It probably depends at least on the amount of register pressure the kernel is facing. – tera Jun 02 '17 at 21:55
  • @tera: Frankly, it has not occurred to me to benchmark this; how could it possibly be slower? Anyway, if you have the code you used for benchmarking, do post a link to it. – einpoklum Jun 02 '17 at 21:57
  • I've not benchmarked it by itself either - I've just inserted basically your code (but without the `volatile`) into code I've been working on, but that I can't post. – tera Jun 02 '17 at 22:01
  • 1
    I believe that accessing the special registers has some non-negligible penalty associated. I remember that (back in the days of CUDA 2.x and compute capability 1.3, so this might be rather dated info) the compiler used to cache special register contents rather heavily, which made me wary of using them without benchmarking the effect. – tera Jun 02 '17 at 22:06
  • @tera: Respecting your input, I've added a caveat to my answer. But, really, it boggles the mind to think that nVIDIA would provide a special register which is better not used at all. Also see my edit re what I see in the PTX. – einpoklum Jun 02 '17 at 22:13
  • 3
    An Nvidia employee has made some [interesting comments](https://devtalk.nvidia.com/default/topic/1011523/cuda-programming-and-performance/how-costly-is-the-s2r-instruction-reading-a-special-register-/post/5165296/#5165296) over at the Nvidia forum. – tera Jul 25 '17 at 19:28
  • "As for %lane_id, it does give you the correct value, but it's misleadingly non-performant: Even though it's a "register", it's not like the regular registers in your register file, with 1-cycle access latency." This is incredibly misleading. Computing laneid yourself requires using threadDim.x. Look at the disassembly for that -- it resolves to an S2R instruction, with SR_TID.X. Computing laneid will require up to three of these calls (for each dimension). – Benjie Nov 09 '21 at 21:35
  • @Benjie: Any non-trivial kernel already uses threadDim.x, i.e. already pays the price for getting it. But - I'll edit to clarify this point. – einpoklum Nov 09 '21 at 22:10
5

The other answer is very dangerous! Compute the lane-id and warp-id yourself.

#include <cuda.h>
#include <iostream>

inline __device__ unsigned get_lane_id() {
  unsigned ret;
  asm volatile("mov.u32 %0, %laneid;" : "=r"(ret));
  return ret;
}

inline __device__ unsigned get_warp_id() {
  unsigned ret;
  asm volatile("mov.u32 %0, %warpid;" : "=r"(ret));
  return ret;
}

__global__ void kernel() {
  const int actual_warpid = get_warp_id();
  const int actual_laneid = get_lane_id();
  const int expected_warpid = threadIdx.x / 32;
  const int expected_laneid = threadIdx.x % 32;
  if (expected_laneid == 0) {
    printf("[warp:] actual: %i  expected: %i\n", actual_warpid,
           expected_warpid);
    printf("[lane:] actual: %i  expected: %i\n", actual_laneid,
           expected_laneid);
  }
}

int main(int argc, char const *argv[]) {
  dim3 grid(8, 7, 1);
  dim3 block(4 * 32, 1);

  kernel<<<grid, block>>>();
  cudaDeviceSynchronize();
  return 0;
}

which gives something like

[warp:] actual: 4  expected: 3
[warp:] actual: 10  expected: 0
[warp:] actual: 1  expected: 1
[warp:] actual: 12  expected: 1
[warp:] actual: 4  expected: 3
[warp:] actual: 0  expected: 0
[warp:] actual: 13  expected: 2
[warp:] actual: 12  expected: 1
[warp:] actual: 6  expected: 1
[warp:] actual: 6  expected: 1
[warp:] actual: 13  expected: 2
[warp:] actual: 10  expected: 0
[warp:] actual: 1  expected: 1
...
[lane:] actual: 0  expected: 0
[lane:] actual: 0  expected: 0
[lane:] actual: 0  expected: 0
[lane:] actual: 0  expected: 0
[lane:] actual: 0  expected: 0
[lane:] actual: 0  expected: 0
[lane:] actual: 0  expected: 0
[lane:] actual: 0  expected: 0
[lane:] actual: 0  expected: 0
[lane:] actual: 0  expected: 0
[lane:] actual: 0  expected: 0

see also the PTX docs

A predefined, read-only special register that returns the thread's warp identifier. The warp identifier provides a unique warp number within a CTA but not across CTAs within a grid. The warp identifier will be the same for all threads within a single warp.

Note that %warpid is volatile and returns the location of a thread at the moment when read, but its value may change during execution, e.g., due to rescheduling of threads following preemption.

Hence, it is the warp-id of the scheduler without any guarantee that it matches the virtual warp-id (started by counting from 0).

The docs makes this clear:

For this reason, %ctaid and %tid should be used to compute a virtual warp index if such a value is needed in kernel code; %warpid is intended mainly to enable profiling and diagnostic code to sample and log information such as work place mapping and load distribution.

If you think, ok let's use CUB for this: This even affects cub::WarpId()

Returns the warp ID of the calling thread. Warp ID is guaranteed to be unique among warps, but may not correspond to a zero-based ranking within the thread block.

EDIT: Using %laneid seems to be safe.

Patwie
  • 4,360
  • 1
  • 21
  • 41
  • I would like to delete my answer, based on previous comments and your answer. So, would you mind editing yours to be a warning against using the `%warp_id` register? Also, do you have any qualms regarding the use of `%lane_id` ? – einpoklum Oct 11 '18 at 08:47
  • Feel free to edit my answer. For the `%lane_id` I edited the code. I was just doing bug hunting in my code taking your answer as the ultimate wisdom and trying the find the bug in my own lines. – Patwie Oct 11 '18 at 09:41
  • Your code is now wrong: 1. You're assigning %lane_id to lane_id. 2. You're only checking the lane id with lane 0. – einpoklum Oct 11 '18 at 09:44
  • I changed the code. It seems that the lane_id is not affected. – Patwie Oct 11 '18 at 09:45
  • Patwie: So lane id is fine; but you're only printing it from lane 0. – einpoklum Oct 11 '18 at 09:46
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/181665/discussion-between-patwie-and-einpoklum). – Patwie Oct 11 '18 at 10:37