17

Consider the following code:

__global__ void kernel(int *something) {
    extern __shared__ int shared_array[];     

    // Some operations on shared_array here.
}

Is it possible to initialize the whole shared_array to some value - e.g. 0 - without explicitly addressing each cell in some thread?

paleonix
  • 2,293
  • 1
  • 13
  • 29
fsh
  • 319
  • 1
  • 2
  • 8

3 Answers3

21

You can efficiently initialize shared arrays in parallel like this

// if SHARED_SIZE == blockDim.x, eliminate this loop
for (int i = threadIdx.x; i < SHARED_SIZE; i += blockDim.x) 
    shared_array[i] = INITIAL_VALUE;
__syncthreads();
harrism
  • 26,505
  • 2
  • 57
  • 88
  • That's only the case if you have a 1D block, of course. Saying just so any newbies don't fall into obvious traps. I also wonder how much boost float4, which is another trick still gives on newer devices and how much benefit does it provide combined with this memory coalescing type of init. Sidenote, if you're loading inside a 2D, or 3D kernel, it's importrant to know that they are partitioned into warps like array of [z][y][x] is inside memory. So let [x] threads write closest to each other and those different in their [z] furthest. – Íhor Mé Aug 19 '16 at 17:48
  • 1
    So I've tried and yes, using reinterpret_cast to copy in 16-byte chunks like in https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-increase-performance-with-vectorized-memory-access/ still gets a little bit better results. Also, it's important to set __restrict__ on the data you want to copy --- easy +15% performance boost, about as much as 16-byte aligned chunks give. – Íhor Mé Aug 19 '16 at 19:42
  • Mind that memory coalescing can mean a 16x speed difference easily. – Íhor Mé Aug 19 '16 at 20:27
  • @ÍhorMé Not sure why you say this can't work for a 2D or 3D block. You need to do the addressing appropriately, but the general approach works regardless of block dimension. – harrism Aug 22 '16 at 02:12
  • Yes, I'm talking about the exact code snippet. The basic idea will work, but in order to implement it with 2D/3D blocks one also has to know how blocks are partitioned into warps to make sure that memory will coalesce. – Íhor Mé Aug 23 '16 at 10:42
18

No. Shared memory is uninitialised. You have to somehow initialise it yourself, one way or another...

From CUDA C Programming Guide 3.2, Section B.2.4.2, paragraph 2:

__shared__ variables cannot have an initialization as part of their declaration.

This also discards nontrivial default constructors for shared variables.

CygnusX1
  • 20,968
  • 5
  • 65
  • 109
2

Yes, you can. You can specify that the first thread in the block sets it, while the other's don't eg.:

extern __shared__ unsigned int local_bin[]; // Size specified in kernel call

if (threadIdx.x == 0) // Wipe on first thread - include " && threadIdx.y == 0" and " && threadIdx.z == 0"  if threadblock has 2 or 3 dimensions instead of 1.
{
    // For-loop to set all local_bin array indexes to specified value here - note you cannot use cudaMemset as it translates to a kernel call itself
}

// Do stuff unrelated to local_bin here    

__syncthreads(); // To make sure the memset above has completed before other threads start writing values to local_bin.

// Do stuff to local_bin here

Ideally you should do as much work as possible before the syncthreads call, as this allows for all the other threads to do their work before the memset is complete - obviously this only matters if the work has the potential to have quite different thread completion times, for example if there is conditional branching. Note that for the thread 0 "setting" for-loop, you need to have passed the size of the local_bin array as a parameter to the kernel so you know the size of the array you are iterating.

Original concept source

metamorphosis
  • 1,972
  • 16
  • 25
  • Thank you. I used something similar to this in my final implementation. – rayryeng May 30 '17 at 05:48
  • 4
    this loses the benefits of parallelization, always try to use threadIdx and BlockIdx as much as possible – ejectamenta Sep 25 '17 at 12:28
  • 2
    I think you're right in general, but it depends on whether the initialization value can be calculated from those or whether it has a more complex numerical/etc assignment pattern. – metamorphosis Sep 26 '17 at 00:54