4

Here's my problem: I have quite a big set of doubles (it's an array of 77.500 doubles) to be stored somewhere in cuda. Now, I need a big set of threads to sequentially do a bunch of operations with that array. Every thread will have to read the SAME element of that array, perform tasks, store results in shared memory and read the next element of the array. Note that every thread will simultaneously have to read (just read) from the same memory location. So I wonder: is there any way to broadcast the same double to all threads with just one memory read? Reading many times would be quite useless... Any idea??

Matteo Monti
  • 8,362
  • 19
  • 68
  • 114

1 Answers1

6

This is a common optimization. The idea is to make each thread cooperate with its blockmates to read in the data:

// choose some reasonable block size
const unsigned int block_size = 256;

__global__ void kernel(double *ptr)
{
  __shared__ double window[block_size];

  // cooperate with my block to load block_size elements
  window[threadIdx.x] = ptr[threadIdx.x];

  // wait until the window is full
  __syncthreads();

  // operate on the data
  ...
}

You can iteratively "slide" the window across the array block_size (or maybe some integer factor more) elements at a time to consume the whole thing. The same technique applies when you'd like to store the data back in a synchronized fashion.

Jared Hoberock
  • 11,118
  • 3
  • 40
  • 76
  • 3
    Also: broadcast from shared memory (i.e. having all the threads in the block read the same memory location) is a fast case. CUDA implementations of N-body problems make use of broadcast in conjunction with the idiom that Jared has described here. – ArchaeaSoftware Nov 01 '11 at 11:34