34

Can someone please help me with a very simple example on how to use shared memory? The example included in the Cuda C programming guide seems cluttered by irrelevant details.

For example, if I copy a large array to the device global memory and want to square each element, how can shared memory be used to speed this up? Or is it not useful in this case?

talonmies
  • 70,661
  • 34
  • 192
  • 269
Tudor
  • 61,523
  • 12
  • 102
  • 142

2 Answers2

42

In the specific case you mention, shared memory is not useful, for the following reason: each data element is used only once. For shared memory to be useful, you must use data transferred to shared memory several times, using good access patterns, to have it help. The reason for this is simple: just reading from global memory requires 1 global memory read and zero shared memory reads; reading it into shared memory first would require 1 global memory read and 1 shared memory read, which takes longer.

Here's a simple example, where each thread in the block computes the corresponding value, squared, plus the average of both its left and right neighbors, squared:

  __global__ void compute_it(float *data)
  {
     int tid = threadIdx.x;
     __shared__ float myblock[1024];
     float tmp;

     // load the thread's data element into shared memory
     myblock[tid] = data[tid];

     // ensure that all threads have loaded their values into
     // shared memory; otherwise, one thread might be computing
     // on unitialized data.
     __syncthreads();

     // compute the average of this thread's left and right neighbors
     tmp = (myblock[tid > 0 ? tid - 1 : 1023] + myblock[tid < 1023 ? tid + 1 : 0]) * 0.5f;
     // square the previousr result and add my value, squared
     tmp = tmp*tmp + myblock[tid] * myblock[tid];

     // write the result back to global memory
     data[tid] = tmp;
  }

Note that this is envisioned to work using only one block. The extension to more blocks should be straightforward. Assumes block dimension (1024, 1, 1) and grid dimension (1, 1, 1).

Karu
  • 4,512
  • 4
  • 30
  • 31
Patrick87
  • 27,682
  • 3
  • 38
  • 73
  • 9
    didn't you miss a synchronization barrier before start computing the neighbors data in shared memory ? – pQB Nov 04 '11 at 15:34
  • 6
    My favorite image to use when explaining shared memory is a Formula 1 pit stop. www.youtube.com/watch?v=UUvagsM176o The operation is performed in a somewhat constrained environment that enables a bunch of people to work in parallel on the same car, enabling the task to be done much more quickly. – ArchaeaSoftware Nov 05 '11 at 13:22
  • Thanks for this very helpful example, there was an extra parenthesis at the end of the first line involving tmp, so I edited it. – John Powell Jan 22 '19 at 19:58
16

Think of shared memory as an explicitly managed cache - it's only useful if you need to access data more than once, either within the same thread or from different threads within the same block. If you're only accessing data once then shared memory isn't going to help you.

Paul R
  • 208,748
  • 37
  • 389
  • 560
  • 1
    So for example, in the array squaring problem, it would not help, but if I cache let's say A and B in a matrix multiplication it works because they are reused several times? – Tudor Nov 04 '11 at 15:21
  • @Tudor: exactly - anything that would benefit from say L1 cache in a conventional app could potentially benefit from using shared memory in a CUDA application - so if you're just reading a value, squaring it and then writing it out, then there is no benefit in either case. – Paul R Nov 04 '11 at 15:23