Aggregate many small arrays in fewer large arrays by basic function

Question

I have many small 2D arrays (e.g. M x 32 x 40) and fewer larger 2D arrays (e.g. N x 200 x 300). I would like to 'put' the smaller matrices at indices n,i,j in the larger arrays (upper left index of the array at batch index n). These small arrays could overlap and should be aggregated by functions that are associative and commutative say plus, multiply, etc.

I figure this is a pretty basic scenario that many people should have come across, right? Is there a cuda implementation that supports this in an efficient way?

Typical values M = 10^6, N = 10^4

I have made a small illustration

The minus operator is not associative nor commutative, sot this is probably much more complex to implement this efficiently in CUDA than only with associative+commutative operators like the plus operator. The same thing apply for combined operators. This also means that the order of the matrices is important. — Jérôme Richard, Mar 11 '21 at 21:38
Very good comment! Edited my question. Its actually about plus — Haydnspass, Mar 11 '21 at 22:26
It's not clear how you determine `n`; I think that will be important to find a performant solution. Does each of your `M` arrays have a `n` associated with it, in some other array, perhaps, that identifies which of the `N` arrays it should be added to? — Robert Crovella, Mar 11 '21 at 22:35
small `n` is known and given (as is `i` and `j`). Yes each of the `M` arrays has an `n` associated with it. — Haydnspass, Mar 11 '21 at 22:48
This problem looks to be analogous to the finite element method assembly process. If you search, you should find a range of papers from the early days of CUDA through until the present. I doubt you will find an off-the-shelf library implementation — talonmies, Mar 12 '21 at 02:37

score 1 · Accepted Answer · answered Mar 13 '21 at 19:30

This is a reduction operation.

In addition to what is expressed in the comments, I'll make the assumption that the distribution of the M matrices in terms of which of the N matrices they belong to, is relatively uniform, i.e. evenly distributed. This means for the dimensions given, that there will be approximately 100 of the M matrices that intended to update N matrix 0, 100 for N matrix 1, and so on. Furthermore, if we inspect the n array, we would observe a uniformly random pattern of indices (i.e. no clumping or grouping).

Given that, in what may be a first for me, I'll suggest a lock/critical section algorithm, using the plumbing from here. Each threadblock will take one of the M arrays, and attempt to acquire a lock so that it can update the appropriate N array. When finished, release the lock.

I considered other approaches as well, some of which are evident in the code. In any event, for the stated conditions, the lock based approach had a kernel runtime of about 40ms on my V100 GPU, which was the best I observed.

I would also note that the stated dimensions result in a data working set of ~8GB. Not that that is a problem, just be aware if running this code as-is on your laptop GPU.

Here's an example:

$ cat t34.cu
#include <iostream>
#include <cstdlib>

const int N = 10000;
const int M = 1000000;
const int Mx = 32;
const int My = 40;
const int Nx = 200;
const int Ny = 300;
const int nTPB = 256;

template <typename T>
__host__ __device__
T reduction_op(T &a, const T &b){ return a+b;}

template <typename T>
__global__ void k(const T * __restrict__ M, T * __restrict__ N, const int * __restrict__ n, const int * __restrict__ i, const int * __restrict__ j, const int num_M){
  for (int ii = 0; ii < num_M; ii++){
    if (n[ii] == blockIdx.x) {
      for (int jj = threadIdx.x; jj < Mx*My; jj += blockDim.x){
        int y = jj/Mx;
        int x = jj - (y*Mx);
        N[blockIdx.x*Nx*Ny + i[ii] + (j[ii]+y)*Nx + x] = reduction_op(
          N[blockIdx.x*Nx*Ny + i[ii] + (j[ii]+y)*Nx + x], M[ii*Mx*My + y*Mx + x]);}
      }
    __syncthreads();}
}

// assumes Ny is whole-number divisible by sl
template <typename T>
__global__ void ki(const T * __restrict__ M, T * __restrict__ N, const int * __restrict__ n, const int * __restrict__ i, const int * __restrict__ j, const int num_M, const int sl){
  extern __shared__ T s[];
  for (int c = 0; c < Ny; c+=sl){  // process per chunk of N array
// load shared
    for (int t = threadIdx.x; t < sl*Nx; t += blockDim.x) s[t] = N[blockIdx.x*Nx*Ny + c*Nx + t];
    __syncthreads();
// process chunk stack
    for (int ii = 0; ii < num_M; ii++){  // iterate through "stack"
      if ((n[ii] == blockIdx.x) && (j[ii] < (c+sl)) && ((j[ii]+My) > c)) {
        for (int jj = threadIdx.x; jj < sl*Mx; jj += blockDim.x){
          int y = jj/Mx;
          int x = jj - (y*Mx);
          //y += c;
          if ((y+c >= j[ii]) && (y+c < (j[ii]+My)))
            s[y*Nx+x+i[ii]] = reduction_op(s[y*Nx+x+i[ii]], M[ii*Mx*My + (y+c-j[ii])*Mx + x]);}
      }
    __syncthreads();}
// save shared
    for (int t = threadIdx.x; t < sl*Nx; t += blockDim.x) N[blockIdx.x*Nx*Ny + c*Nx + t] = s[t];
  }
}


template <typename T>
__global__ void ka(const T * __restrict__ M, T * __restrict__ N, const int * __restrict__ n, const int * __restrict__ i, const int * __restrict__ j, const int num_M){
   int x = threadIdx.x;
   for (int y = threadIdx.y; y < My; y += blockDim.y)
     atomicAdd(N+n[blockIdx.x]*Nx*Ny+(j[blockIdx.x]+y)*Nx+i[blockIdx.x]+x, M[blockIdx.x*Mx*My+y*Mx+x]);
}

__device__ void acquire_semaphore(volatile int *lock){
  while (atomicCAS((int *)lock, 0, 1) != 0);
  }

__device__ void release_semaphore(volatile int *lock){
  *lock = 0;
  __threadfence();
  }


template <typename T>
__global__ void kl(const T * __restrict__ M, T * __restrict__ N, const int * __restrict__ n, const int * __restrict__ i, const int * __restrict__ j, const int num_M, int * __restrict__ locks){

  if ((threadIdx.x == 0) && (threadIdx.y == 0))
    acquire_semaphore(locks+n[blockIdx.x]);
  __syncthreads();
  //begin critical section
  int x = threadIdx.x;
  for (int y = threadIdx.y; y < My; y += blockDim.y){
        N[n[blockIdx.x]*Nx*Ny + i[blockIdx.x] + (j[blockIdx.x]+y)*Nx + x] = reduction_op(
          N[n[blockIdx.x]*Nx*Ny + i[blockIdx.x] + (j[blockIdx.x]+y)*Nx + x], M[blockIdx.x*Mx*My + y*Mx + x]);}
  // end critical section
  __threadfence(); // not strictly necessary for the lock, but to make any global updates in the critical section visible to other threads in the grid
  __syncthreads();
  if ((threadIdx.x == 0) && (threadIdx.y == 0))
    release_semaphore(locks+n[blockIdx.x]);
}

typedef float mt;

int main(){

  mt *d_M, *h_M, *d_N, *h_N, *r1, *r2;
  int *d_n, *h_n, *d_i, *h_i, *d_j, *h_j;
  h_M = new mt[M*Mx*My];
  h_N = new mt[N*Nx*Ny];
  r1 = new mt[N*Nx*Ny];
  r2 = new mt[N*Nx*Ny];
  h_n = new int[M];
  h_i = new int[M];
  h_j = new int[M];
  cudaMalloc(&d_M, M*Mx*My*sizeof(mt));
  cudaMalloc(&d_N, N*Nx*Ny*sizeof(mt));
  cudaMalloc(&d_n, M*sizeof(int));
  cudaMalloc(&d_i, M*sizeof(int));
  cudaMalloc(&d_j, M*sizeof(int));
  for (int i = 0; i < M; i++){
    h_n[i] = rand()%N;
    h_i[i] = rand()%(Nx - Mx);
    h_j[i] = rand()%(Ny - My);}
  for (int i = 0; i < N*Nx*Ny; i++) h_N[i] = (mt)(i%3);
  for (int i = 0; i < M*Mx*My; i++) h_M[i] = (mt)((i%3)+1);
  cudaMemcpy(d_M, h_M, M*Mx*My*sizeof(mt), cudaMemcpyHostToDevice);
  cudaMemcpy(d_N, h_N, N*Nx*Ny*sizeof(mt), cudaMemcpyHostToDevice);
  cudaMemcpy(d_n, h_n, M*sizeof(int), cudaMemcpyHostToDevice);
  cudaMemcpy(d_i, h_i, M*sizeof(int), cudaMemcpyHostToDevice);
  cudaMemcpy(d_j, h_j, M*sizeof(int), cudaMemcpyHostToDevice);
#ifdef USE_SINGLE_N
  cudaMemset(d_n, 0, M*sizeof(int));
#endif
#if 0
  const int sl = 40;
  const int sb = sl * Nx * sizeof(mt);
  ki<<<N, nTPB, sb>>>(d_M, d_N, d_n, d_i, d_j, M, sl);
  cudaMemcpy(r2, d_N, N*Nx*Ny*sizeof(mt), cudaMemcpyDeviceToHost);
#endif
  dim3 block(Mx, 8);
#if 0
  ka<<<M, block>>>(d_M, d_N, d_n, d_i, d_j, M);
  cudaMemcpy(r2, d_N, N*Nx*Ny*sizeof(mt), cudaMemcpyDeviceToHost);
#endif
  int *d_locks;
  cudaMalloc(&d_locks, N*sizeof(int));
  cudaMemset(d_locks, 0, N*sizeof(int));
  kl<<<M, block>>>(d_M, d_N, d_n, d_i, d_j, M, d_locks);
  cudaMemcpy(r2, d_N, N*Nx*Ny*sizeof(mt), cudaMemcpyDeviceToHost);
  cudaMemcpy(d_N, h_N, N*Nx*Ny*sizeof(mt), cudaMemcpyHostToDevice);
  k<<<N, nTPB>>>(d_M, d_N, d_n, d_i, d_j, M);
  cudaMemcpy(r1, d_N, N*Nx*Ny*sizeof(mt), cudaMemcpyDeviceToHost);
  for (int i = 0; i < N*Nx*Ny; i++) if (r1[i] != r2[i]) {std::cout << "mismatch at: " << i << " was: " << r2[i] << " should be: " << r1[i] << std::endl; return 0;}
}
$ nvcc -o t34 t34.cu -O3 -lineinfo
$ nvprof ./t34
==17970== NVPROF is profiling process 17970, command: ./t34
==17970== Profiling application: ./t34
==17970== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   34.57%  3.09036s         2  1.54518s  1.54294s  1.54742s  [CUDA memcpy DtoH]
                   33.18%  2.96615s         1  2.96615s  2.96615s  2.96615s  void k<float>(float const *, float*, int const *, int const *, int const *, int)
                   31.81%  2.84401s         6  474.00ms  1.4255ms  1.27035s  [CUDA memcpy HtoD]
                    0.45%  39.949ms         1  39.949ms  39.949ms  39.949ms  void kl<float>(float const *, float*, int const *, int const *, int const *, int, int*)
                    0.00%  2.1120us         1  2.1120us  2.1120us  2.1120us  [CUDA memset]
      API calls:   96.13%  8.94558s         8  1.11820s  1.9203ms  4.51030s  cudaMemcpy
                    3.60%  334.59ms         6  55.765ms  277.58us  330.37ms  cudaMalloc
                    0.15%  13.752ms         8  1.7190ms  1.3268ms  2.2025ms  cuDeviceTotalMem
                    0.11%  10.472ms       808  12.959us     172ns  728.50us  cuDeviceGetAttribute
                    0.01%  997.81us         8  124.73us  100.93us  176.73us  cuDeviceGetName
                    0.00%  69.047us         2  34.523us  32.349us  36.698us  cudaLaunchKernel
                    0.00%  68.013us         1  68.013us  68.013us  68.013us  cudaMemset
                    0.00%  46.172us         8  5.7710us  1.8940us  23.025us  cuDeviceGetPCIBusId
                    0.00%  8.5060us        16     531ns     260ns  1.5030us  cuDeviceGet
                    0.00%  3.7870us         8     473ns     229ns     881ns  cuDeviceGetUuid
                    0.00%  3.3980us         3  1.1320us     610ns  2.0780us  cuDeviceGetCount
$

Extended discussion:

On performance:

This is a memory bound algorithm. Therefore, we can estimate optimal kernel performance by determining the minimum number of memory reads and writes needed to perform the operation, then dividing by the available memory bandwidth, to determine the optimal or lower-bound for kernel duration. Unfortunately the determination of the minimum number of reads and writes depends on the positioning of the M matrices, so cannot be easily generally determined, without inspecting the n, i, and j matrices.

However we can look for another way to estimate. Another approach to estimation would be to observe that each M matrix update will require reading 2 values and writing one value. If we then use that as our estimate, we come up with M*Mx*My*3*sizeof(element_of_M)/GPU_memory_bandwidth. On my V100 (~700GB/s BW) this works out to about 20ms lower bound on kernel duration.

On approaches considered:

"naive" approach, kernel k: Each threadblock will be responsible for one of the N matrices, and will iterate through the M matrices, inspecting n to determine if the M matrices will update the assigned N matrix. This gives a non-optimal run time of ~3s but seems to be mostly invariant performance-wise based on the distribution of n, and can use an "arbitrary" reduction op.
attempt at "optimal" approach, kernel ki: Each threadblock will be responsible for one of the N matrices, but will only load a chunk of that matrix at a time. It will then proceed through the M matrices updating that chunk, similar the the k kernel. This necessitates more loops through the matrices, but should "almost" only load or save each global memory item the minimum number of times necessary. Nevertheless, the run time is really long, ~40s
atomic approach, kernel ka: Each threadblock will be responsible for one of the M matrices, and will atomically update the relevant N matrix. Simplicity. And the runtime is "fast" at ~40ms. (The atomic approach may be even faster than this is non-uniform n distributions. I witnessed kernel runtimes as low as 8ms!) However this is not readily generalizable to operations that don't have an atomic equivalent, such as multiply.
lock based approach, kernel kl: Like the atomic approach, each threadblock will be responsible for one of the M matrices, and will first acquire a lock on the relevant N matrix. The lock means that atomics are not necessary. For the uniformly distributed n case presented, it has about the same performance as the atomic case. It has the benefit that it can handle other reduction ops, such as multiply, readily. A disadvantage is that in the presence of non-uniformly-random distribution in n the performance can suffer, with a worst case in the ballpark of the naive kernel (3-5s).

Overall if the requirement for an arbitrary reduction operator can be dropped (e.g. only use addition, for example) then the atomic method may be best.

Aggregate many small arrays in fewer large arrays by basic function

1 Answers1