Inter-block synchronization in CUDA

Question

I've searched a month for this problem. I cannot synchronize blocks in CUDA.

I've read a lot of posts about atomicAdd, cooperative groups, etc. I decided to use an global array so a block could write on one element of global array. After this writing, a thread of block waits(i.e. trapped in a while loop) until all blocks write global array.

When I used 3 blocks my synchronization works well (because I have 3 SM). But using 3 blocks gives me 12% occupancy. So I need to use more blocks, but they can't be synchronized. The problem is: a block on a SM waits for other blocks, so the SM can't get another block.

What can I do? How can synchronize blocks when there are blocks more than the number of SMs?

CUDA-GPU specification: CC. 6.1, 3 SM, windows 10, VS2015, GeForce MX150 graphic card. Please help me for this problem. I used a lot of codes but none of them works.

I can, but when number of SMs and number of blocks equals. It doesn't make sense that there is no way. I need it. — pedram64, Dec 14 '18 at 09:59
It makes perfect sense. The architecture and programming model are basically incapable of this sort of synchronization. If that doesn't work for you, then you either need a different algorithm, or you need to use a different sort of parallel hardware. Just because you need something or think is doesn't make sense doesn't automatically make it is possible — talonmies, Dec 14 '18 at 11:06
Can I use it for inter block synchronization? I mean using parent grid and child grids — pedram64, Dec 14 '18 at 13:38

score 5 · Accepted Answer · answered Dec 14 '18 at 14:41

5

The CUDA programming model methods to do inter-block synchronization are

(implicit) Use the kernel launch itself. Before the kernel launch, or after it completes, all blocks (in the launched kernel) are synchronized to a known state. This is conceptually true whether the kernel is launched from host code or as part of CUDA Dynamic Parallelism launch.
(explicit) Use a grid sync in CUDA Cooperative groups. This has a variety of requirements for support, which you are starting to explore in your other question. The simplest definition of support is if the appropriate property is set (cooperativeLaunch). You can query the property programmatically, using cudaGetDeviceProperties().

answered Dec 14 '18 at 14:41

Robert Crovella

143,785
11
213
257

Thanks for your reply. But which way is better in performance? 1)Launch the kernel again 2)using dynamic parallelism – pedram64 Dec 14 '18 at 15:49
1

Performance questions usually can't be answered in a vacuum. Stated another way "it depends". A carefully crafted launch in cooperative groups might be better in many scenarios than using multiple kernel launches to achieve synchronization. However it probably depends on the exact algorithm and your coding skills. Between multiple host launches or multiple device launches using dynamic parallelism, I don't think there is much to recommend one over the other. They should be pretty equivalent. – Robert Crovella Dec 14 '18 at 15:51

Inter-block synchronization in CUDA

1 Answers1

Linked

Related