So I'm exploring WebGPU and figured it would be an interesting exercise to implement a basic neural network in it. Having little understanding of both GPU shader programming and neural networks and my only reference for WebGPU(w3.org/TR/webgpu and w3.org/TR/WGSL) being highly technical has made it really interesting indeed.
Anyway, somehow I've muddled my way to a point where I can actually perform feed forward and back propagation correctly on small network, also its blazingly fast compared to my js cpu implementation, even though I'm sure I'm severely underutilizing the hardware.
I've come to a point where I want to try bigger networks but I'm at a bit of a loss when it comes to workgroups and synchronizing execution. For the purpose of keeping it simple, I'll focus my problem on the feed forward operation:
Currently, I'm dispatching exactly the number of threads that correspond to the widest layer in the neural network. The idea is that each thread computes the value for a single neuron in the current layer and then hits a barrier and then every thread moves on to the next layer together, on and on.
The problem is, I only of two ways to set a barrier - either workgroupBarrier() or ending execution and dispatching a new pile of threads for the next layer.
The problem with the first one is that it only works within a workgroup and I can only make workgroups so big before the performance starts suffering because from what I understand, only a single CU can work on a workgroup because of the need to share memory. If I make my workgroup 256x256 then it would get cut into chunks that the single CU would have to chew through while rest of the hardware sits idle. This limits how wide I can make my networks by how many threads a single CU can fit in it, pretty lame.
The problem with the second one is pretty obvious - a separate dispatch is just slow, much slower than a barrier from my testing.
As it is right now, I'm not using workgroup shared memory at all, all I want to do is dispatch an arbitrary number of threads and have a global barrier. As far as I understand though, WebGPU doesn't have a global barrier... except maybe storageBarrier?
Even after reading the 2 sentences on w3.org about what it is, I still have no clue as to what it is but I think its something to do with memory access synchronization rather than a global barrier. I did test it, the results come out correct, however even if I remove all barriers from my code the result comes out correct, perks of the SIMT execution style of GPU's I guess. However, I don't need it to be "probably correct" I need guaranteed correct, so I need a global barrier. Is storageBarrier the thing? If not then what is it?
Bonus question - why are there 3 dimensions to workgroups and dispatches, why not just have one?