The relevant documents only have described the relation of gl_GlobalInvocationID
and gl_LocalInvocationID
, such as in https://www.khronos.org/opengl/wiki/Compute_Shader .
gl_GlobalInvocationID = gl_WorkGroupID * gl_WorkGroupSize + gl_LocalInvocationID;
I guess gl_LocalInvocationID
and gl_SubgroupInvocationID
should have similar relation. As far as I know, the scheduling unit of GPU is subgroup/Warp instead of an invocation/thread. Partitioning local workgroup into subgroups should happen before scheduling. At this time, the GPU should regard all invocations of a subgroup as the same. The divergence within a subgroup should happen in real execution caused by dynamic branching. Therefore, pre-determined relation between gl_LocalInvocationID
and gl_SubgroupInvocationID
may not have any impact on the subsequent scheduling. But if without determined relation, there is so much limitation for utilizing subgroup feature.
In my opinion, there is no reason for the implementation without determined relation between these two.
I did the following test on RTX 3080Ti,
layout(local_size_x = 16, local_size_y = 16, local_size_z = 1) in;
void main()
{
uint localInvocationIndex = gl_LocalInvocationID.z * gl_WorkGroupSize.x * gl_WorkGroupSize.y +
gl_LocalInvocationID.y * gl_WorkGroupSize.x +
gl_LocalInvocationID.x;
if((localInvocationIndex% gl_SubgroupSize) != gl_SubgroupInvocationID){
debugPrintfEXT("(%u, %u, %u)\n", gl_GlobalInvocationID.x, gl_LocalInvocationID.x, gl_SubgroupInvocationID);
}
}
No log information is displayed. And after I replace !=
with ==
, many log informations are displayed. The code below is the same
layout(local_size_x = 16, local_size_y = 16, local_size_z = 1) in;
void main()
{
uint localInvocationIndex = gl_LocalInvocationID.z * gl_WorkGroupSize.x * gl_WorkGroupSize.y +
gl_LocalInvocationID.y * gl_WorkGroupSize.x +
gl_LocalInvocationID.x;
uint subgroupInvocationIndex = gl_SubgroupID * gl_SubgroupSize + gl_SubgroupInvocationID;
if(localInvocationIndex != subgroupInvocationIndex){
debugPrintfEXT("(%u, %u, %u)\n", gl_GlobalInvocationID.x, gl_LocalInvocationID.x, gl_SubgroupInvocationID);
}
}
Thus, in my test, the relation between gl_LocalInvocationID
and gl_SubgroupInvocationID
is
gl_SubgroupID * gl_SubgroupSize + gl_SubgroupInvocationID ==
gl_LocalInvocationIndex
And also
gl_LocalInvocationIndex % gl_SubgroupSize == gl_SubgroupInvocationID
Although there is still no official documentation explaining this, thus no guarantee for the above conclusions. However, I found some post related to CUDA. CUDA lane ID vs threadIdx.x based computation
ThreadIdx and LaneId described in the post should be equivalent to gl_LocalInvocationIndex and gl_SubgroupInvocationID.