I'm trying to generate 32 64x64 bitmaps with a single CUDA kernel call. When rendering these images, I want to randomize the parameters of image generation both per-image and per-pixel. That is, some randomized decisions happen once and apply consistently to all pixels in an image, while other decisions are made independently for each pixel. I'm trying to figure out a cuRAND setup to enable this.
The approach I have so far uses two state arrays: one with 32 sequences (one per image) and another with 4096 sequences (one per pixel). I pass both into my kernel and compute each pixel value based on both. This sorta works, but I'm seeing some weird artifacts. I'm looking for advice on how to fix this, or suggestions for an alternative approach that would work better.
If I render the images using only the per-pixel noise, I would expect to get the same image of random static 32 times. What I actually get is different but highly correlated images of random static. Interestingly, the first several images are almost identical, and the later images (larger img_id) become more different.
If I render the images using only the per-image noise, I would expect each image to be a solid block of some random color. What I actually get is mostly images of a solid color, but sometimes the four quadrants of the image aren't the same. Again, the first images are all consistent, and the later images are more varied.
I suspect part of my problem is that each 64x64 image is actually composed of a 2x2 grid of blocks that are 32x32 threads each (my device supports at most 1024 threads per block). The cuRAND docs say "two different blocks can not operate on the same state safely," but I don't see any guidance on what to do about that.
Can anyone offer some insight into what's going wrong here? Any advice on how to fix this, or another approach that would work better?
Code snippet below:
__global__ void init_rngs(curandState* per_img_rng_state, curandState* per_pxl_rng_state) {
int row = blockIdx.x * blockDim.x + threadIdx.x;
int col = blockIdx.y * blockDim.y + threadIdx.y;
int img_id = blockIdx.z * blockDim.z;
int pxl_id = col * 64 + row;
curand_init(42, img_id, 0, &per_img_rng_state[img_id]);
curand_init(42, pxl_id, 0, &per_pxl_rng_state[pxl_id]);
}
__global__ void make_images(curandState* per_img_rng_state, curandState* per_pxl_rng_state, unsigned char* image) {
int row = blockIdx.x * blockDim.x + threadIdx.x;
int col = blockIdx.y * blockDim.y + threadIdx.y;
int img_id = blockIdx.z * blockDim.z;
int pxl_id = col * 64 + row;
unsigned int per_img_noise = curand(&per_img_rng_state[img_id]);
unsigned int per_pxl_noise = curand(&per_pxl_rng_state[pxl_id]);
// An example of logic mixing the two sources of noise.
unsigned int density = per_img_noise;
unsigned int value = per_img_noise ^ per_pxl_noise;
image[img_id][row][col] = (value >= density) ? 0x00 : 0xFF;
// An example using only per-pixel noise:
image[img_id][row][col] = (per_pxl_noise & 1) ? 0x00 : 0xFF;
// An example using only per-image noise:
image[img_id][row][col] = per_img_noise / 16777216;
}
void randomize_images() {
curandState* per_img_rng_state = nullptr;
curandState* per_pxl_rng_state = nullptr;
unsigned char* image = nullptr;
cudaMalloc(&image, 32*64*64);
cudaMalloc(&per_img_rng_state, 32 * sizeof(curandState));
cudaMalloc(&per_pxl_rng_state, 64 * 64 * sizeof(curandState));
// Blocks are arranged 2x2x32, meaning 32 images made out of 4 blocks in a 2x2 grid.
// Each block gets 32x32 threads, one per pixel in each quadrant of the image.
init_rngs<<<{2, 2, 32}, {32, 32}>>>(per_img_rng_state, per_pxl_rng_state);
make_images<<<{2, 2, 32}, {32, 32}>>>(per_img_rng_state, per_pxl_rng_state, image);
}