Cellular automata on GPU with WGSL

Question

I am writing a physic simulation which is like a cellular automata. Each steps dependents on the previous one, but more precisely, each cell needs the state of itself and its direct neighbors to compute its new state. I am using two buffers, alternating roles at each step (multiple reads / single write).

I am using WGSL (WebGPU), and for the moment, for every step (whole grid update, in other word t+1) I call a dispatch (to ensure synchronization between steps), but it results in quite slow performances. (EDIT: because I was not making use of workgroup properly)

I tried to performs the steps with a loop directly in the shader but I am unable to synchronize all work group between each step. Because I was supicious that the comunication between CPU and GPU was the limiting factor. (SPOILER ALERT: no, it is not)

I tried using storageBarrier and workgroupBarrier, which does not work (synchronization does not occur). Nonetheless, if I only use two successive steps with one barrier between them, I increase performance by 2, meaning I am loosing most of the time during dispatch. And the result is almost perfect (meaning some synchronization did not happen but did not affect that much the result).

EDIT: the previous paragraph is a misunderstanding, the result of my test was misleading.

I read that it is impossible to synchronize all work groups in a single dispatch with the current specification of WGSL. But then I don't understand why is there a workgroupBarrier and a storageBarrier ??

How can I force all work groups to synchronize between each step of cellular automata ?

But more generally, I guess I am not the first person writing a cellular automata on the GPU with this direct neighbor dependency:

How to write fast cellular automata using GPU ?

score 3 · Answer 1 · answered May 21 '23 at 10:37

3

I'm not sure how exactly you're going about writing your program. I'm guessing compute and maybe you're trying to read and write to the same buffer?

Usually cellular automata is coded using two buffers. One for the state in the last step (read-only) and one for the new state in the current step (write-only). Each invocation can read multiple values from the previous step and usually writes one value on the current buffer.

At the end of each step, you can swap them. You should not need any barriers this way and can be implemented in either graphics or compute pipelines.

answered May 21 '23 at 10:37

Axiverse

1,589
3
14
30

https://github.com/lemire/SIMDgameoflife/blob/master/include/basicautomata.h is an example of a CPU SIMD (AVX2) implementation. It fills a scratch buffer of neighbor-counts and then updates the original states in-place, which is pretty much the same thing for a single thread. And could easily be adapted to just update a separate destination instead, which might also be more efficient for the CPU-SIMD implementation, maybe avoiding a store/reload and 2nd pass over the data. – Peter Cordes May 21 '23 at 11:43
Using a scratch buffer then copying to the original buffer is just double buffering with an extra copy (so that you don't have to swap buffers the next time). – Axiverse May 30 '23 at 06:08
Yes. But note that it's not just copying in the final pass, each pass does part of the actual work. (Other than maybe the top and bottom row, it should be easy to optimize to one pass, and that's what I was suggesting. I'd been hoping to find it as an example of actually using two buffers the efficient way, but it turned out it was doing this extra loading+storing.) – Peter Cordes May 30 '23 at 07:13

Cellular automata on GPU with WGSL

How can I force all work groups to synchronize between each step of cellular automata ?

How to write fast cellular automata using GPU ?

1 Answers1