1

I have read the subgroup wiki descibing gl_SubgroupInvocationID and the compute shader wiki describing gl_LocalInvocationID.

  • gl_LocalInvocationID means the shader invocation ID in a local workgroup
  • gl_SubgroupInvocationID means the shader invocation ID in a subgroup

Is there exist determined relation between gl_LocalInvocationID and gl_SubgroupInvocationID?

Suppose subgroup size is 16, given a local workgroup with 64 invocation. Is the following statement true?

The invocations with gl_LocalInvocationID range [0, 15], [16, 31], [32, 47] or [48, 63] have the gl_SubgroupInvocationID range [0, 15]. In other word, is it true that gl_LocalInvocationID % 16 == gl_SubgroupInvocationID?

Yakov Galka
  • 70,775
  • 16
  • 139
  • 220
wangsy
  • 159
  • 11

3 Answers3

4

The numerical relationship between subgroup IDs and local IDs is unspecified. Neither Vulkan nor OpenGL defines any such thing, so implementations are allowed to do more or less whatever they want.

Nicol Bolas
  • 449,505
  • 63
  • 781
  • 982
  • Yes, I didn't find any information about the numerical relation between local ID and subgroup ID neither. However, after a little test, I found the relation between local ID nad subgroup ID is determined. I will post my finding in the following answer. – wangsy Jun 05 '22 at 09:03
  • I am observing a very strange (for me at least) behavior, which looks like not only arrangement of subgroup indices can change at runtime, but subgroup invocations can even spread multiple workgroups despite workgroup being twice larger than a subgroup. – YaaZ Oct 28 '22 at 18:18
  • @YaaZ: What do you mean by "spread multiple workgroups"? – Nicol Bolas Oct 28 '22 at 18:23
  • I use Nvidia Quadro GPU and it looks like driver uses some heuristics, trying different patterns each time application starts. It checks around 4-5 different patterns and then stops at simple row major order. Interesting thing is that in some of these patterns, gl_WorkGroupID is different for different invocations of the same subgroup (that's what I meant by subgroup spreading multiple workgroups). And one more detail is that I see this with 4x4x4 workgroup, if I use 16x4x1, it tries less patterns and when using 64x1x1, gl_SubgroupInvocationID is always sequential with gl_LocalInvocationID.x – YaaZ Oct 28 '22 at 18:30
  • I guess spread is a bad wording, I meant it can *span* multiple workgroups. – YaaZ Oct 28 '22 at 18:35
  • @YaaZ: "*gl_WorkGroupID is different for different invocations of the same subgroup*" So, how do you define "same subgroup"? `gl_SubgroupID` is the ID of a subgroup *within* a workgroup. So invocations from different work groups almost certainly will have the same `gl_SubgroupID`s. That doesn't mean that they are in the "same subgroup"; they don't share subgroup data. – Nicol Bolas Oct 28 '22 at 18:51
  • @NicolBolas well I meant "subgroup" as what all `subgroup*` functions work on. And one of the tests that I did shown `subgroupAllEqual(gl_WorkGroupID)` returning false. For a few patterns only, of course, but still weird. – YaaZ Oct 28 '22 at 19:04
  • Ok trying different patterns sounds logic to me. For multidimensional workgroups it can help improving data locality depending on how memory access is organized in shader. But mixing subgroup invocations between workgroups is crazy, maybe I misinterpreted something, maybe not. BTW is there any clear statement that all subgroup invocations must be in a single workgroup? I see that subgroup is a bunch of invocations running on a single compute unit, yes we have `gl_SubgroupID`, but what stops us from splitting subgroup invocations between workgroups in which of these it will have the same id? – YaaZ Oct 28 '22 at 20:46
  • 1
    @YaaZ: From the standard, "For shaders that have defined workgroups, each invocation in a subgroup must be in the same local workgroup." – Nicol Bolas Oct 28 '22 at 21:08
  • Well then Nvidia violates the spec. I just tried the debug printf, there's a GLSL snippet and the output: https://pastebin.com/tUzzys1W Summary: two halves of the subgroup invocations ended up in different workgroups. – YaaZ Nov 06 '22 at 00:37
0

It is not specified and varies between hardware vendors. There are no deqp tests for driver certification that check it.

My experiments shows that NVidia tends to have row-major order and Intel has column-major.

I'm also not aware of any restriction on width and height of such subgroup (in terms of localInvocation indexes). Let's say you have a 16 size subgroup, you usually get 4x4 square. I guess that implementation is allowed to also give you a 8x2 subgroup.

Dorian
  • 377
  • 5
  • 18
-2

The relevant documents only have described the relation of gl_GlobalInvocationID and gl_LocalInvocationID, such as in https://www.khronos.org/opengl/wiki/Compute_Shader .

gl_GlobalInvocationID = gl_WorkGroupID * gl_WorkGroupSize + gl_LocalInvocationID;

I guess gl_LocalInvocationID and gl_SubgroupInvocationID should have similar relation. As far as I know, the scheduling unit of GPU is subgroup/Warp instead of an invocation/thread. Partitioning local workgroup into subgroups should happen before scheduling. At this time, the GPU should regard all invocations of a subgroup as the same. The divergence within a subgroup should happen in real execution caused by dynamic branching. Therefore, pre-determined relation between gl_LocalInvocationID and gl_SubgroupInvocationID may not have any impact on the subsequent scheduling. But if without determined relation, there is so much limitation for utilizing subgroup feature. In my opinion, there is no reason for the implementation without determined relation between these two.

I did the following test on RTX 3080Ti,

layout(local_size_x = 16, local_size_y = 16, local_size_z = 1) in;

void main()
{
    uint localInvocationIndex = gl_LocalInvocationID.z * gl_WorkGroupSize.x * gl_WorkGroupSize.y + 
        gl_LocalInvocationID.y * gl_WorkGroupSize.x + 
        gl_LocalInvocationID.x;
    if((localInvocationIndex% gl_SubgroupSize) != gl_SubgroupInvocationID){
        debugPrintfEXT("(%u, %u, %u)\n", gl_GlobalInvocationID.x, gl_LocalInvocationID.x, gl_SubgroupInvocationID);
    }
}

No log information is displayed. And after I replace != with ==, many log informations are displayed. The code below is the same

layout(local_size_x = 16, local_size_y = 16, local_size_z = 1) in;

void main()
{
    uint localInvocationIndex = gl_LocalInvocationID.z * gl_WorkGroupSize.x * gl_WorkGroupSize.y + 
        gl_LocalInvocationID.y * gl_WorkGroupSize.x + 
        gl_LocalInvocationID.x;
    uint subgroupInvocationIndex = gl_SubgroupID * gl_SubgroupSize + gl_SubgroupInvocationID;
    if(localInvocationIndex != subgroupInvocationIndex){
        debugPrintfEXT("(%u, %u, %u)\n", gl_GlobalInvocationID.x, gl_LocalInvocationID.x, gl_SubgroupInvocationID);
    }
}

Thus, in my test, the relation between gl_LocalInvocationID and gl_SubgroupInvocationID is

gl_SubgroupID * gl_SubgroupSize + gl_SubgroupInvocationID ==
gl_LocalInvocationIndex 

And also

gl_LocalInvocationIndex % gl_SubgroupSize == gl_SubgroupInvocationID 

Although there is still no official documentation explaining this, thus no guarantee for the above conclusions. However, I found some post related to CUDA. CUDA lane ID vs threadIdx.x based computation ThreadIdx and LaneId described in the post should be equivalent to gl_LocalInvocationIndex and gl_SubgroupInvocationID.

wangsy
  • 159
  • 11
  • 1
    "*I think gl_LocalInvocationID and gl_SubgroupInvocationID should have similar relation.*" What "should" be true is irrelevant; what matters is what is *guaranteed* to be true. And this post does not cite any authority about what the guarantees are. All you're doing is showing what some implementation happens to do in your specific case. And you don't even say *which* GPU you're using. – Nicol Bolas Jun 05 '22 at 13:17
  • @NicolBolas I have made some changes to notify others all conclusions are my guesses, and remove the anwser marker – wangsy Jun 06 '22 at 02:18