Metal compute shaders threadgroup & threadExecutionWidth

Question

Can someone explain in simple terms what threadgroup conceptually is in Metal compute shaders and other terms such as SIMD group, threadExecutionWidth (wavefront)? I read the docs but am more confused. For instance, if I have a 1024x1024 image, how many threadgroups can I have, how can I map thread to each pixel, how many can run concurrently, etc.? I can't find WWDC video describing compute shaders and these concepts.

Matthijs Hollemans · Accepted Answer · 2019-04-03T17:01:12.330

7

A threadgroup is a group of threads that work together to solve a certain (sub)problem. You can have a maximum of 512 or 1024 threads in a threadgroup (depending on the device you're using).

The threadExecutionWidth is the size of the SIMD groups used. It's typically 32, meaning each SIMD group has 32 threads in it. For optimal performance, the number of threads in your threadgroup should be a multiple of threadExecutionWidth. (This is indeed what others call the wavefront or warp.)

If you have a 1024x1024 image and you want one thread to process one pixel, and the maximum threadgroup size is 512, then you can create a grid of 1024x1024 threads that consists of 32x64 threadgroups of size 32x16 (i.e. 512).

But really, you can divide up the threads however you want. You could also have a grid of 2x1024 threadgroups of size 512x1, or whatever.

edited Apr 03 '19 at 17:01

answered Apr 30 '18 at 08:32

Matthijs Hollemans

7,706
2
16
23

You didn't explain the significance of SIMD group. Any WWDC video covering these topics? Also if maxTotalThreadsPerThreadGroup is 512, and image is 1024x1024, can we safely assume that pixels will be processed in chunks of 512 serially? In other words, the next group of 512 pixels will not be processed unless the previous 512 pixels have been processed? – Deepak Sharma Apr 30 '18 at 09:06
Also, how do we define the shape of the grid -- 1x512, 2x256, or 32x16, etc.? I am unable to find any extensive tutorial or WWDC video describing these details unless someone can point me out. – Deepak Sharma Apr 30 '18 at 09:11
4

The GPU hardware is split up into several SIMD groups. If threadExecutionWidth is 32 and the maxThreadsPerThreadgroup is 512, that means there are 512/32=16 of these SIMD groups in the hardware and each of these SIMD groups can run 32 threads at a time. The GPU will decide which group of 32 threads to schedule in which SIMD group -- as a developer you have no control over this. Actual hardware details are unpublished by Apple, so exactly how the GPU works is mostly guesswork. – Matthijs Hollemans Apr 30 '18 at 09:25
2

As for "can we safely assume that pixels will be processed in chunks of 512 serially?". First off, *you* as the developer determines what this grid of thread looks like and what each thread should do. The GPU doesn't care, it just launches the threads you asked for. Second, the GPU can launch these threads in any order, but it will always do so in groups of threadExecutionWidth because it must always run an entire SIMD group at a time. Even if you use 1 thread, it still runs the entire SIMD group and just throws away the results from the other 31 threads. – Matthijs Hollemans Apr 30 '18 at 09:27
deep in [Metal-Feature-Set-Tables.pdf](https://developer.apple.com/metal/Metal-Feature-Set-Tables.pdf) (footnote 3 on page 9) they say that 512 or 1024 is the "theoretical maximum", and that you should grab `MTLComputePipelineState.maxTotalThreadsPerThreadgroup` at runtime to know the actual maximum. – whlteXbread Jun 29 '23 at 22:38

Metal compute shaders threadgroup & threadExecutionWidth

1 Answers1