-1

I use atomic operation to calculate summation of the values like histogram.

So, I use shared memory first to store the value in the block, and the values stored at the shared memory in each block are saved in the global memory next.

The whole code is follows.

__global__ void KERNEL_RIGID_force_sum(part1*P1,part3*P3,int_t*nop_sol,Real*xcm,Real*ycm,Real*zcm,Real*sum_fx,Real*sum_fy,Real*sum_fz)
{
    int_t i=threadIdx.x+blockIdx.x*blockDim.x;
    if(i>=k_num_part2) return;  
    if(P1[i].i_type==3) return;
    // if(P1[i].p_type<RIGID) return;

    // initilalize accumulation array in shared memory
    __shared__ int_t tmp_nop[128];
    __shared__ Real tmp_xcm[128],tmp_ycm[128],tmp_zcm[128];
    __shared__ Real tmp_fx[128],tmp_fy[128],tmp_fz[128];

    tmp_nop[threadIdx.x]=0;
    tmp_xcm[threadIdx.x]=0;
    tmp_ycm[threadIdx.x]=0;
    tmp_zcm[threadIdx.x]=0;
    tmp_fx[threadIdx.x]=0;
    tmp_fy[threadIdx.x]=0;
    tmp_fz[threadIdx.x]=0;
    __syncthreads();

    Real xi,yi,zi;
    Real fxi,fyi,fzi;
    int_t ptypei;
    
    ptypei=P1[i].p_type;
    xi=P1[i].x;    
    yi=P1[i].y;    
    zi=P1[i].z;
    fxi=P3[i].ftotalx;    
    fyi=P3[i].ftotaly;    
    fzi=P3[i].ftotalz;



    // save values to shared memory
    atomicAdd(&tmp_nop[ptypei],1);
    atomicAdd(&tmp_xcm[ptypei],xi);
    atomicAdd(&tmp_ycm[ptypei],yi);
    atomicAdd(&tmp_zcm[ptypei],zi);
    atomicAdd(&tmp_fx[ptypei],fxi);
    atomicAdd(&tmp_fy[ptypei],fyi);
    atomicAdd(&tmp_fz[ptypei],fzi);
    __syncthreads();

    // save shared memory values to global memory
    atomicAdd(&nop_sol[threadIdx.x],tmp_nop[threadIdx.x]);
    atomicAdd(&xcm[threadIdx.x],tmp_xcm[threadIdx.x]);
    atomicAdd(&ycm[threadIdx.x],tmp_ycm[threadIdx.x]);
    atomicAdd(&zcm[threadIdx.x],tmp_zcm[threadIdx.x]);
    atomicAdd(&sum_fx[threadIdx.x],tmp_fx[threadIdx.x]);
    atomicAdd(&sum_fy[threadIdx.x],tmp_fy[threadIdx.x]);
    atomicAdd(&sum_fz[threadIdx.x],tmp_fz[threadIdx.x]);

}

But, there are some problems.

Because the number of thread block is 128 in my code, I allocate shared memory and global memory size as 128.

How can I do if I want to use shared memory larger than max number of thread size 1,024? (when there are more than 1,024 p_type)

If I allocate shared memory size as 1,024 or higher value, system says

ptxas error   : Entry function '_Z29KERNEL_RIGID_force_sum_sharedP17particles_array_1P17particles_array_3PiPdS4_S4_S4_S4_S4_' uses too much shared data (0xd000 bytes, 0xc000 max)

To put it simply, I don't know what to do when the size to perform reduction is more than 1,024.

Is it possible to calculate using anything else other than threadIdx.x?

Could you give me some advice?

HYChoi
  • 1
  • 1

1 Answers1

1

Shared memory is limited in size. The default limits for most GPUs is 48KB. It has no direct connection to the number of threads in the threadblock. Some GPUs can go as high as 96KB, but you haven't indicated what GPU you are running on. The error you are getting is not directly related to the number of threads per block you have, but to the amount of shared memory you are requesting per block.

If the amount of shared memory you need exceeds the shared memory available, you'll need to come up with another algorithm. For example, a shared memory reduction using atomics (what you seem to have here) could be converted into an equivalent operation using global atomics.

Another approach would be to determine if it is possible to reduce the size of the array elements you are using. I have no idea what your types (Real, int_t) correspond to, but depending on the types, you may be able to get larger array sizes by converting to 16-bit types. cc7.x or higher devices can do atomic add operations on 16-bit floating point, for example, and with a bit of effort you can even do atomics on 8-bit integers.

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257