1

Using this question as basis I implemented a pseudo-random number generator with a global state:

  __global uint global_random_state;

  void set_random_seed(uint seed){
    global_random_state = seed;
  }

  uint get_random_number(uint range){
    uint seed = global_random_state + get_global_id(0);
    uint t = seed ^ (seed << 11);
    uint result = seed ^ (seed >> 19) ^ (t ^ (t >> 8));
    global_random_state = result; /* race condition? */
    return result % range;
  }

Since these functions will be used from multiple threads, there will be a race condition present when writing to global_random_state.

This might actually help the system to be more unpredictable, so it seems like a good thing, but I'd like to know if there are any consequences to this that might not surface immediately. Are there any side-effects inside the GPU which might cause problems later on when the kernel is run?

Dávid Tóth
  • 2,788
  • 1
  • 21
  • 46

1 Answers1

2

In theory you want atom_cmpxchg for correctness here (or find the equivalent GPGPU). However, a grave note of warning, having the entire machine serializing through a single cacheline is going to strangle your performance fundamentally. Atomics on the same address must form a queue and wait. Atomics on different locations can parallelize (more details at the end).

Generally, algorithms that leverage random variables on GPGPU will keep their own copy of the random variable generators. This enables each work item to cache and potentially reuse their own random with out glutting the bus with memory traffic on every new random. Search for "OpenCL Monte Carlo" "Simulation" or "Example" for samples. CUDA has some nice examples too.

Another option is to use a random generator that allows one to skip ahead and have different work items move forward in the sequence different amounts. This can be more compute intensive though, but the tradeoff is that you don't strain the memory hierarchy as much.

More gory details on atomics: (1) GPU cache atomics are designed to expect contiguous arrays and atomic ALUs are per bank, (2) each dword in a cacheline will be processed by the same atomic ALU each time, and (3) neighboring cachelines will hash to different banks. So, if every clock you are doing atomics on contiguous cachelines of data then the work should be perfectly spread out (or statistically so). Conversely, if one makes every work item atomically modify the same 32b, then the cache system cannot apply all the same atomic ALU slot to 16/32/64 (whatever your system uses). It must break the operation up in 16/32/64 separate atomic operations apply it iteratively (by #2 above). In a system where you have 512 ALUs to process atomics you would be using 1 of those ALUs each clock (the same one). Spread the work out and you can use all 512/c.

Tim
  • 2,708
  • 1
  • 18
  • 32
  • Sorry, can you please clear up a bit on the first part? Is it only straining the GPU only when the atom_cmpxchg is used, or is it straining the system with the current usage? – Dávid Tóth Feb 18 '22 at 08:00
  • Sure thing. I meant the proposed use (a single global dword for the whole system). – Tim Feb 18 '22 at 19:34
  • Thanks! It also somehow produced false negative tests with CATCH2 unit test approximations, so it was a bad choice overall. I guess it had something to do with the cache-lines of the GPU – Dávid Tóth Feb 19 '22 at 18:24