Managing multiple cuRAND sequences between threads and blocks

Question

I'm trying to generate 32 64x64 bitmaps with a single CUDA kernel call. When rendering these images, I want to randomize the parameters of image generation both per-image and per-pixel. That is, some randomized decisions happen once and apply consistently to all pixels in an image, while other decisions are made independently for each pixel. I'm trying to figure out a cuRAND setup to enable this.

The approach I have so far uses two state arrays: one with 32 sequences (one per image) and another with 4096 sequences (one per pixel). I pass both into my kernel and compute each pixel value based on both. This sorta works, but I'm seeing some weird artifacts. I'm looking for advice on how to fix this, or suggestions for an alternative approach that would work better.

If I render the images using only the per-pixel noise, I would expect to get the same image of random static 32 times. What I actually get is different but highly correlated images of random static. Interestingly, the first several images are almost identical, and the later images (larger img_id) become more different.

If I render the images using only the per-image noise, I would expect each image to be a solid block of some random color. What I actually get is mostly images of a solid color, but sometimes the four quadrants of the image aren't the same. Again, the first images are all consistent, and the later images are more varied.

I suspect part of my problem is that each 64x64 image is actually composed of a 2x2 grid of blocks that are 32x32 threads each (my device supports at most 1024 threads per block). The cuRAND docs say "two different blocks can not operate on the same state safely," but I don't see any guidance on what to do about that.

Can anyone offer some insight into what's going wrong here? Any advice on how to fix this, or another approach that would work better?

Code snippet below:

__global__ void init_rngs(curandState* per_img_rng_state, curandState* per_pxl_rng_state) {
    int row = blockIdx.x * blockDim.x + threadIdx.x;
    int col = blockIdx.y * blockDim.y + threadIdx.y;
    int img_id = blockIdx.z * blockDim.z;
    int pxl_id = col * 64 + row;

    curand_init(42, img_id, 0, &per_img_rng_state[img_id]);
    curand_init(42, pxl_id, 0, &per_pxl_rng_state[pxl_id]);
}

__global__ void make_images(curandState* per_img_rng_state, curandState* per_pxl_rng_state, unsigned char* image) {
    int row = blockIdx.x * blockDim.x + threadIdx.x;
    int col = blockIdx.y * blockDim.y + threadIdx.y;
    int img_id = blockIdx.z * blockDim.z;
    int pxl_id = col * 64 + row;

    unsigned int per_img_noise = curand(&per_img_rng_state[img_id]);
    unsigned int per_pxl_noise = curand(&per_pxl_rng_state[pxl_id]);

    // An example of logic mixing the two sources of noise.
    unsigned int density = per_img_noise;
    unsigned int value = per_img_noise ^ per_pxl_noise;
    image[img_id][row][col] = (value >= density) ? 0x00 : 0xFF;

    // An example using only per-pixel noise:
    image[img_id][row][col] = (per_pxl_noise & 1) ? 0x00 : 0xFF;

    // An example using only per-image noise:
    image[img_id][row][col] = per_img_noise / 16777216;
}

void randomize_images() {
  curandState* per_img_rng_state = nullptr;
  curandState* per_pxl_rng_state = nullptr;
  unsigned char* image = nullptr;

  cudaMalloc(&image, 32*64*64);
  cudaMalloc(&per_img_rng_state, 32 * sizeof(curandState));
  cudaMalloc(&per_pxl_rng_state, 64 * 64 * sizeof(curandState));

  // Blocks are arranged 2x2x32, meaning 32 images made out of 4 blocks in a 2x2 grid.
  // Each block gets 32x32 threads, one per pixel in each quadrant of the image.
  init_rngs<<<{2, 2, 32}, {32, 32}>>>(per_img_rng_state, per_pxl_rng_state);
  make_images<<<{2, 2, 32}, {32, 32}>>>(per_img_rng_state, per_pxl_rng_state, image);
}

Well, you want to randomize "both ...". But you did say anything about distributions. It seems to me that a Mersenne Twister on CPU with a pixel noise (all images being "merged" in terms of (linear) distribution ie., you just get a value from last pixel image n to first pixel image n+1) would be a lot simpler and faster to implement. The same on cuda can be better if you have strong requirements in terms of time to compute (eg., 1us to compute all or display directly the result bypassing the host). Anyway, I recommend to always validate an approach on CPU before implementing on cuda. — Soleil, Dec 25 '22 at 19:25
Could you publish an image to illustrate your weird artefacts ? — Soleil, Dec 25 '22 at 19:26
It would be trivial to implement this in a single thread on the CPU, yes. However, I need to run this many thousands of times as part of a genetic algorithm. More importantly, there's another much more complex kernel I haven't written yet that needs per-image and per-pixel randomness, and I'm trying to find a general technique for doing that. — Nate Gaylinn, Dec 25 '22 at 20:08
I tried to capture and upload some images. Problem is, after a reboot, I'm seeing the glitches much less often (though they aren't completely gone), so its hard to provide a clear example. That's another problem, actually, because I need deterministic pseudo-randomness. — Nate Gaylinn, Dec 25 '22 at 20:22
Pseudo random numbers are deterministic by definition; if you start with the same seed you will always have the same serie. — Soleil, Dec 26 '22 at 10:52
When we work with gpu, we usually benchmark everything. Did you do that ? I suspect that the device-host copy of images might completely blow away the speed advantage of computing on GPU, not mentionning (it depends on your implementation) the memory allocations, which takes time; unless you planned a limited amount of allocations (eg., 10-100) and sollicit them by batch and recycle them, thus limiting the allocation and free operations (both on device and host). I apologize if it's off-topic, but I had this feeling that not all was taken into account. Anyway, there is yet an answer to your OP — Soleil, Dec 26 '22 at 11:00
PRNs provide a deterministic sequence, yes, but if you populate your vector in a non-deterministic order, you're effectively shuffling that sequence. That's my problem, not how to call an RNG, but how to synchronize RNG calls across threads and blocks. — Nate Gaylinn, Dec 26 '22 at 16:00
No, I haven't done any benchmarking and optimizaiton. That's because I'm trying to determine a viable method for writing CUDA kernels that use randomness in a complex and particular way. Speed isn't the point right now. I'm asking about a toy example for the sake of making the question clear so I can learn something about how cuRand works. Why am I insisting on doing this in CUDA before I've shown the need for speed? Because this is part of a much larger project which I intend to run on the GPU hundreds of thousands of times, only copying results to the host on the last iteration. — Nate Gaylinn, Dec 26 '22 at 16:07
Can you publish a complete self-sufficient code or project (eg., through github) ? — Soleil, Dec 26 '22 at 19:29
No, because it doesn't exist yet. It would also be an irrelevant distraction from the core question: what is a good model for sharing multiple RNGs of different scopes across threads in a CUDA kernel? In any case, I have some ideas to try today. Maybe I'll find an answer. — Nate Gaylinn, Dec 27 '22 at 16:44

score 0 · Accepted Answer · answered Dec 27 '22 at 19:28

Okay, I think I've got this working, but if anyone sees a problem with this or has a better idea, please let me know.

The issue here is that cuRand generally wants each thread to have its own unique sequence, and I'm explicitly trying to share the sequences between many threads. That works fine, but each call to curand() tries to update the position in the sequence, and if many threads try to update the same RNG state at once, you get race conditions. That's why the random noise I was generating had undesired correlations and was not deterministic.

I dealt with this problem by managing sequence positions manually. Whenever I'm trying to read a value from the RNG, I make a local copy of its state and I don't copy it back to the global state when I'm done. This prevents threads from interfering with each others' randomness. Then I have to manually advance the RNGs using skipahead().

This seems to work well, though it has a few drawbacks. Most seriously, I have to keep track of how many random numbers I generate and make sure I skipahead() by the right number of positions. This is tricky and error prone. The other drawback is efficiency. This solution requires more copies from global memory and frequent calls to skipahead(). It also means every call to curand() is incrementing a local state object that just gets thrown away. This still seems better than generating randomness on the host, since it allows my program to run entirely on the GPU without any host / device memory transfers.

This changes my kernel invocations to look more like this:

init_img_rng<<<1, 32>>>(img_rng);
init_pxl_rng<<<{2, 2}, {32, 32}>>>(pxl_rng);
make_images<<<{2, 2, 32}, {32, 32}>>>(img_rng, pxl_rng, image);
inc_img_rng<<<1, 32>>>(img_rng);
inc_pxl_rng<<<{2, 2}, {32, 32}>>>(pxl_rng);

Note this also addresses some inefficiency in the initial code where I used the same block configuration for all my kernel calls even though the init_rngs kernel was processing less data than make_images. Now each kernel invocation is sized to match the data its operating on.

In the original question, I generated per-pixel noise that was unique to each image by combining the per-image and per-pixel noise using the XOR operation. I'm not sure that's really safe, so instead I decided to use skipahead(img_id, pxl_rng) before reading from the per-pixel RNG. That way, each image is reading from the same RNG data but at a different sequence offset.

Alternatively, I think it should work to allocate two curandState objects for every thread and just share sequence numbers between them when I want their values to be synchronized. The main drawback here is the memory footprint of 262,144 curandState objects (vs 4,128 in the above solution). That's more space than the images themselves take up, but it may be acceptable. I'd still have to keep track of how many random values I generate, since if one thread draws too many it would get out of sync. But that would mean tracking exceptions rather than tracking every random value I draw. — Nate Gaylinn, Dec 27 '22 at 23:53
I think I found a way to avoid this issue by using fewer random numbers and populating them with fewer threads. What makes this hard is using many threads to populate the same vector of pseudo-random numbers with strict reproducibility requirements. That's not what curand is designed to do, so I had to find workarounds. It works much better when a single thread is generating many random numbers. In my case, that means less than full utilization and efficiency, but that's fine. It's probably still better than doing it all in one CPU thread and transferring the data between host and GPU. — Nate Gaylinn, Jan 22 '23 at 19:30

Managing multiple cuRAND sequences between threads and blocks

1 Answers1