CUDA lane ID vs threadIdx.x based computation

Question

It's easiest to explain via cub::LaneId() or a function like the following:

inline __device__ unsigned get_lane_id() {
  unsigned ret;
  asm volatile("mov.u32 %0, %laneid;" : "=r"(ret));
  return ret;
}

Versus computing the lane ID as threadIdx.x & 31 .

Do these 2 approaches produce the same value in a 1D grid?

__ballot_sync() documentation speaks of lane IDs in its mask parameter, and as I understand it returns the bits set per lane ID. So would the following asserts never fail?

int nWarps = /*...*/;
bool condition = /*...*/;
if(threadIdx.x < nWarps) {
  assert(__activemask() == ((1u<<nWarps)-1));
  uint32_t res = __ballot_sync(__activemask(), condition);
  assert(bool(res & (1<<threadIdx.x)) == condition);
}

score 4 · Answer 1 · edited Oct 18 '21 at 14:05

From the PTX ISA documentation: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#special-registers-laneid

%laneid A predefined, read-only special register that returns the thread's lane within the warp. The lane identifier ranges from zero to WARP_SZ-1.

This register will always contain the correct value, whereas threadIdx.x & 31 assumes that the warp size is 32. However, for all GPU generations to date, the warpsize has been 32, so for both old and current architectures the computed lane will be identical. There is no guarantee that this would always be the case, however.

On your question regarding assertion. With independent thread scheduling, there is no guarantee that all threads in a warp will execute __activemask() at the same time. I think the assertion may fail.

Quoting from the programming guide: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#independent-thread-scheduling-7-x

Note that threads within a warp can diverge even within a single code path. As a result, __activemask() and __ballot(1) may return only a subset of the threads on the current code path.

It might be worth noting that AMD wave size is 64. (But, since very few of us are running CUDA on AMD hardware...) — 3Dave, Oct 18 '21 at 14:15

score 1 · Answer 2 · answered Oct 19 '21 at 17:22

Do these 2 approaches produce the same value in a 1D grid?

Yes (while CUDA's warp size is 32). See also this question:

What's the most efficient way to calculate the warp id / lane id in a 1-D grid?

But I'd write it this way:

enum { warp_size = 32 };

// ...

inline unsigned lane_id() {
    constexpr const auto lane_id_mask = warp_size - 1;
    return threadIdx.x & lane_id_mask;
}

and if you want to be extra-pedantic, you could always static-assert to ensure the warp size is a power of 2 :-P

So would the following asserts never fail?

That code looks weird. Why would you left-shift by the thread ID or the number of warps? Don't see why that shouldn't fail.

CUDA lane ID vs threadIdx.x based computation

2 Answers2

Linked