2

I've been learning about parallel/GPU programming a lot recently, and I've encountered a situation that's stumped me. What happens when two threads in a warp/wave attempt to write to the same exact location in shared memory? Specifically, I'm confused as to how this can occur when warp threads each execute the exact same instruction at the same time (to my understanding).

For instance, say you dispatch a shader that runs 32 threads, the size of a normal non-AMD warp. Assuming no dynamic branching (which as I understand, will normally call up a second warp to execute the branched code? I could be very wrong about that), what happens if we have every single thread try to write to a single location in shared memory?

Though I believe my question applies to any kind of GPU code, here's a simple example in HLSL:

groupshared uint test_target;

#pragma kernel WarpWriteTest
[numthreads(32, 1, 1)]
void WarpWriteTest (uint thread_id: SV_GroupIndex) {
    test_target = thread_id;
}

I understand this is almost certainly implementation-specific, but I'm just curious what would generally happen in a situation like this. Obviously, you'd end up with an unpredictable value stored in test_target, but what I'm really curious about is what happens on a hardware level. Does the entire warp have to wait until every write is complete, at which point it will continue executing code in lockstep (and would this result in noticeable latency)? Or is there some other mechanism to GPU shared memory/cache that I'm not understanding?

Let me clarify, I'm not asking what happens when multiple threads try to access a value in global memory/DRAM—I'd be curious to know, but my question is specifically concerned the shared memory in a threadgroup. I also apologize if this information is readily available somewhere else—as anyone reading might know, GPU terminology in general can be very nebulous and non-standardized, so I've had difficulty even knowing what I should be looking for.

Thank you so much!

  • 1
    **(a)** GPU computing-fabric hosts many SIMD-SMX devices, which are in principle LATENCY-MASKING engines ( hiding memory access-costs, best in warp-wide round-robin mode, less so if divergent threads disallow warp-wide switching ). Just take a look how long does it take for a SMX to access the ShaMEM (~70ns @1147MHz FERMI Streaming Multiprocessor ShaMEM: latency maps in https://stackoverflow.com/a/33065382 ) **(b)** the rest is driven by power, clock and micro-electronics' design compromises of the CRCW-PRAM theory implementation in-silicon (sw-locks/-barriers just block to avoid CW-artifacts) – user3666197 Apr 02 '22 at 02:07

0 Answers0