Confused about thread_position_in_grid

Question

I'm working on a compute shader in Metal on macOS. I'm trying to do some very basic things to learn how they work. I'm seeing some output I don't understand. I thought I would start by trying to generate a simple 2D gradient. The red channel would increase from 0 to 1 along the width and the green channel would increase from 0 to 1 along the height. So I wrote this kernel:

kernel void myKernel(texture2d<half, access::write> outTexture [[ texture(MBKT_OutputTexture) ]],
                     uint2  gid  [[thread_position_in_grid]])
{
    half4  color = half4((float)gid.x / 480.0, (float)gid.y / 360.0, 0.0, 1.0);

    outTexture.write(color, gid);
}

And what I get is an increase from 0 to 0.5 at the halfway point, and a solid 0.5 for the rest of the image, like this:

If I invert the 2 values so the kernel calculates this:

half4  color = half4(1.0 - (float)gid.x / 480.0, 1.0 - (float)gid.y / 360.0, 0.0, 1.0);

the results are even stranger. I would expect it to be 1.0 on the left and bottom and go down to 0.5 in the middle, but instead, I get this:

What is going on here? In the first case, it's like everything past the mid point has a value of 0.5. In the second case it's like the left/bottom edge is 0.5 and the middle is 1.0, then flips back to 0.0 one pixel later.

Oddly, if I use the thread_position_in_grid to pull values out of buffers, it works correctly. For example, I can compute a Mandelbrot set and the results are correct. But I'm confused by what happens with the simple kernel above. Can anyone explain this to me?

Here's my compute kernel setup code in MTKViewDelegate. This is based on the "Hello Compute" sample code from Apple:

    _metalView = metalView;
    _device = metalView.device;
    _commandQueue = [_device newCommandQueue];

    _metalView.colorPixelFormat = MTLPixelFormatBGRA8Unorm_sRGB;

    // Load all the shader files with a .metal file extension in the project
    id<MTLLibrary> defaultLibrary = [_device newDefaultLibrary];

    // Load the kernel function from the library
    id<MTLFunction> kernelFunction = [defaultLibrary newFunctionWithName:@"myKernel"];

    // Create a compute pipeline state
    NSError*    error   = nil;
    _computePipelineState = [_device newComputePipelineStateWithFunction:kernelFunction
                                                                   error:&error];

    if(!_computePipelineState)
    {
        NSLog(@"Failed to create compute pipeline state, error %@", error);
        return nil;
    }

And here's the code where I create the output texture and the thread groups:

MTLTextureDescriptor*   outputTextureDescriptor = [MTLTextureDescriptor texture2DDescriptorWithPixelFormat:MTLPixelFormatBGRA8Unorm_sRGB
                                                                                                     width:_viewportSize.x
                                                                                                    height:_viewportSize.y
                                                                                                 mipmapped:NO];
_outputTexture = [_device newTextureWithDescriptor:outputTextureDescriptor];

// Set the compute kernel's threadgroup size of 16x16
_threadgroupSize = MTLSizeMake(16, 16, 1);

// Calculate the number of rows and columns of threadgroups given the width of the input image
// Ensure that you cover the entire image (or more) so you process every pixel
_threadgroupCount.width  = (_viewportSize.x + _threadgroupSize.width - 1) / _threadgroupSize.width;
_threadgroupCount.height = (_viewportSize.y + _threadgroupSize.height - 1) / _threadgroupSize.height;

// Since we're only dealing with a 2D data set, set depth to 1
_threadgroupCount.depth = 1;

In my tests, the _viewportSize is 480 x 360.

I've done an additional test suggested by @Egor_Shkorov in the comments. Instead of hard-coding 480 and 360, I used the threads_per_grid variable:

kernel void myKernel(
                             texture2d<half, access::write> outTexture [[ texture(MBKT_OutputTexture) ]],
                             uint2  gid  [[thread_position_in_grid]],
                             uint2 tpg [[threads_per_grid]])
{

    half4  color = half4((float)gid.x / tpg.x, (float)gid.y / tpg.y, 0.0, 1.0);

    outTexture.write(color, gid);
}

That improves things, making the gradient stretch all the way in each direction, but it still only goes from 0 to 0.5 instead of to 1 in each direction:

You need to show the app code which dispatches this compute shader, especially the thread group size and thread (group) counts. Also, how is the texture created. (And show the declaration of `outTexture` in your compute shader. Always show real code, since edited code is likely not representative of your actual issue.) — Ken Thomases, Apr 24 '19 at 04:21
OK, I've added them above. Let me know if anything is missing. — user1118321, Apr 24 '19 at 05:00
I would suggest using `threads_per_grid` instead of hard-coded values and then check if output is the same. — JustSomeGuy, Apr 24 '19 at 08:44
Interesting! That causes the various tiles to line up correctly so I get a smooth gradient from left to right and top to bottom, but instead of getting a gradient from 0 to 1 in each direction, it's still a gradient from 0 to 0.5. Why? — user1118321, Apr 24 '19 at 15:47
What if you do `half4 color = half4((float)gid.x / (float)outTexture.get_width(), (float)gid.y / (float)outTexture.get_height(), 0.0, 1.0);`. Also, you should check that `gid.x` and `gid.y` are never larger than the width/height of the output texture, otherwise you end up writing outside the texture memory and bad things will happen. (Note that 360 is not an integer multiple of 16.) — Matthijs Hollemans, Apr 24 '19 at 16:25
That does not change anything, unfortunately. What's interesting, though, is if I run the program on a non-Retina machine, I see the results I expect. Clearly this has something to do with the scale factor, but when the view is set to 480 x 360 in Interface Builder, I actually allocate a 960x720 texture (on the Retina machine), so I don't see why using the width and height of the texture doesn't work. — user1118321, Apr 24 '19 at 17:22
Are you sure you're displaying the texture properly? Perhaps it's being filled properly by the compute shader and you're just showing a corner of it. Perhaps check in Xcode's GPU frame debugger. — Ken Thomases, Apr 24 '19 at 18:25
Oh, good point! However, I can't get the GPU frame debugger to work. I've set the menus in the scheme, but the buttons are never available (always grayed out). I'll see if I can determine that some other way. — user1118321, Apr 24 '19 at 18:27
One thing I'm curious about is why the origin appears in the lower left of your images (and mine) but the Metal docs don't show this placement for the origin in any of the coordinate systems. I would expect the black part of the image to be in the upper left. — johnbakers, Dec 28 '19 at 13:28

score 0 · Answer 1 · answered Feb 07 '21 at 13:06

Very similar thing happens to me. The value of thread_position_in_grid seem to be limited within a small range instead of the whole grid (maybe only the size of a threadgroup). In short, it's probably because you are calling

_commandEncoder.dispatchThreads(threadGroupCount, threadsPerThreadgroup: threadGroupSize)

instead of

_commandEncoder.dispatchThreadgroups(threadGroupCount, threadsPerThreadgroup: threadGroupSize)

I notice the attribute thread_position_in_grid result in different value under those functions. Not sure whether this is an intended behavior since I can't find relevant description in documentation and I would expected this attribute refers to the position in the whole grid. Additionally, Metal will decide the the number of threadgroup when using dispatchThreads() and can create non-uniform threadgroup which maybe has something to do with the issue.

dispatchThreads(_:threadsPerThreadgroup:)

Use this method only if the device supports non-uniform threadgroup sizes. See Metal Feature Set Tables. This method encodes a dispatch call that specifies an arbitrary number of threads in a grid (threadsPerGrid). Metal calculates the number of threadgroups needed, providing partial threadgroups if necessary. When the compute command is encoded, any necessary references to parameters or resources previously set on the encoder are recorded as part of the command. After encoding a command, you can safely change the encoding state to set up parameters needed to encode other commands.

Confused about thread_position_in_grid

1 Answers1

Linked