Kernel for processing a 4D tensor in CUDA

Question

I want to write a kernel to perform computations that depends on all the unique quartets of indices (ij|kl). The code that generate all the unique quartets on the host is as follows:

#include <iostream>


using namespace std;

int main(int argc, char** argv)
{
    unsigned int i,j,k,l,ltop;
    unsigned int nao=7;
    for(i=0;i<nao;i++)
    {
        for(j=0;j<=i;j++)
        {
            for(k=0;k<=i;k++)
            {
                ltop=k;
                if(i==k)
                {
                    ltop=j;
                }
                for(l=0;l<=ltop; l++)
                {
                    printf("computing the ERI (%d,%d|%d,%d) \n",i,j,k,l);
                }
            }    
        }
    }

    int m = nao*(nao+1)/2;
    int eris_size = m*(m+1)/2;
    cout<<"The total size of the sack of ERIs is: "<<eris_size<<endl;

    return 0;
}

output:

computing the ERI (0,0|0,0) 
computing the ERI (1,0|0,0) 
computing the ERI (1,0|1,0) 
computing the ERI (1,1|0,0) 
computing the ERI (1,1|1,0) 
computing the ERI (1,1|1,1) 
computing the ERI (2,0|0,0) 
computing the ERI (2,0|1,0) 
computing the ERI (2,0|1,1) 
computing the ERI (2,0|2,0) 
computing the ERI (2,1|0,0) 
computing the ERI (2,1|1,0) 
computing the ERI (2,1|1,1) 
computing the ERI (2,1|2,0) 
computing the ERI (2,1|2,1) 
computing the ERI (2,2|0,0) 
computing the ERI (2,2|1,0) 
computing the ERI (2,2|1,1) 
computing the ERI (2,2|2,0) 
computing the ERI (2,2|2,1) 
computing the ERI (2,2|2,2) 
computing the ERI (3,0|0,0) 
computing the ERI (3,0|1,0) 
computing the ERI (3,0|1,1) 
computing the ERI (3,0|2,0) 
computing the ERI (3,0|2,1) 
computing the ERI (3,0|2,2) 
computing the ERI (3,0|3,0) 
computing the ERI (3,1|0,0) 
computing the ERI (3,1|1,0) 
computing the ERI (3,1|1,1) 
computing the ERI (3,1|2,0) 
computing the ERI (3,1|2,1) 
computing the ERI (3,1|2,2) 
computing the ERI (3,1|3,0) 
computing the ERI (3,1|3,1) 
computing the ERI (3,2|0,0) 
computing the ERI (3,2|1,0) 
computing the ERI (3,2|1,1) 
computing the ERI (3,2|2,0) 
computing the ERI (3,2|2,1) 
computing the ERI (3,2|2,2) 
computing the ERI (3,2|3,0) 
computing the ERI (3,2|3,1) 
computing the ERI (3,2|3,2) 
computing the ERI (3,3|0,0) 
computing the ERI (3,3|1,0) 
computing the ERI (3,3|1,1) 
computing the ERI (3,3|2,0) 
computing the ERI (3,3|2,1) 
computing the ERI (3,3|2,2) 
computing the ERI (3,3|3,0) 
computing the ERI (3,3|3,1) 
computing the ERI (3,3|3,2) 
computing the ERI (3,3|3,3)

I want to recover the same set of quartets using a CUDA kernel but I cannot get it. The code for CUDA that I have by now is as follows

#include <iostream>
#include <stdio.h>

using namespace std;

#define ABS(x)  (x<0)?-x:x

__global__
void test_kernel(int basis_size)
{
    unsigned int i_idx = threadIdx.x + blockIdx.x * blockDim.x;
    unsigned int j_idx = threadIdx.y + blockIdx.y * blockDim.y;


    // Building the quartets                                                                                                                                                  

    unsigned int i_orbital, j_orbital, k_orbital, l_orbital, i__1,j__1;
    unsigned int i_primitive, j_primitive, k_primitive, l_primitive;

    i_orbital = (i_idx + 1)%basis_size /*+1*/;
    j_orbital = (i__1 = (i_idx) / basis_size, ABS(i__1)) /*+ 1*/;

    k_orbital = (j_idx+1)%basis_size /*+1*/;
    l_orbital = (j__1 = (j_idx) / basis_size, ABS(j__1)) /*+ 1*/;
    unsigned int ltop;
    ltop=k_orbital;
    if(i_orbital==k_orbital)
    {
        ltop=j_orbital;
    }
    if(i_orbital<basis_size && j_orbital<=i_orbital && k_orbital<=i_orbital && l_orbital<=ltop)
        printf("computing the ERI (%d,%d|%d,%d)   \n", i_orbital, j_orbital,k_orbital,l_orbital);
}

int main(int argc, char *argv[])
{
    int nao = 7;
    cudaDeviceReset();
    /* partitioning from blocks to grids */
    int dimx = 8;
    int dimy = 8;
    dim3 block(dimx, dimy);   // we will try blocks of 8x8 threads                                                                                                            
    dim3 grid((nao+block.x-1)/block.x, (nao+block.y-1)/block.y);   // The grids are shaped accordingly                                                                        

    /* Launch the kernel */
    test_kernel<<<grid,block>>>(nao);
    cudaDeviceReset();

    return 0;
}

output:

computing the ERI (2,0|1,1)    
computing the ERI (3,1|3,1)    
computing the ERI (3,0|1,1)    
computing the ERI (1,1|1,1)    
computing the ERI (2,1|1,1)    
computing the ERI (3,1|1,1)    
computing the ERI (3,0|2,1)    
computing the ERI (2,1|2,1)    
computing the ERI (3,1|2,1)    
computing the ERI (1,0|1,0)    
computing the ERI (2,0|1,0)    
computing the ERI (3,0|1,0)    
computing the ERI (1,1|1,0)    
computing the ERI (2,1|1,0)    
computing the ERI (3,1|1,0)    
computing the ERI (2,0|2,0)    
computing the ERI (3,0|3,0)    
computing the ERI (3,0|2,0)    
computing the ERI (2,1|2,0)    
computing the ERI (3,1|3,0)    
computing the ERI (3,1|2,0)    
computing the ERI (1,0|0,0)    
computing the ERI (2,0|0,0)    
computing the ERI (3,0|0,0)    
computing the ERI (0,0|0,0)    
computing the ERI (1,1|0,0)    
computing the ERI (2,1|0,0)    
computing the ERI (3,1|0,0)

The quartets will drive the computation of repulsion integrals whose input parameters are stored in an array of size nao with 3

This could be a result of CUDA `printf`'s output buffer overflow. — void_ptr, Dec 30 '15 at 16:45

score 3 · Accepted Answer · edited May 23 '17 at 11:45

As presented, your code doesn't do "much". It generates indices and prints them out. Furthermore, it's not clear whether nao=7 is actually descriptive of your final problem size, or if it is just for demonstration purposes.

Purely from the standpoint of indices generation, there are a large number of ways to solve your problem. Some of these may naturally lend themselves to efficient usage of the GPU, some may not. But efficient usage of the GPU cannot really be ascertained from the code you have shown so far. For example, one of the desirable goals of a CUDA program is coalesced access from global memory. Since your code shows none of that, it's a bit risky to propose solutions without knowing what the generated access patterns might be.

Therefore it's possible that the "real problem" you are trying to solve may have some considerations that should be contemplated before a final decision is made about index generation (which is effectively tantamount to "assignment of work to a given thread"). I'll try to mention some of these.

Some criticisms of your code:

Your proposed realization suggests use of a 2D threadblock and grid structure to effectively provide the i,j indices that are used by your program. However, from inspection of the original C source code, we see that the j index only covers a space up to the i index:
```
for(i=0;i<nao;i++)
{
    for(j=0;j<=i;j++)
```
However, the replacement of such a structure with a 2D grid/threadblock structure creates a rectangular space, which is not implied by your nested for-loops in the original C code. Therefore launch of such a space/grid would lead to creation of i,j indices that are out of range; those threads would need to do no work. We could work around this (perhaps that is what your code is trying to do) by "re-purposing" out-of-range threads to cover some of the additional space, but this leads to fairly complex and hard to read arithmetic, and see the next criticism.
Your kernel has no loop in it, so it seems evident that each thread can only generate one line of printout. Therefore each thread can only be responsible for one ERI entry. Based on the C source code you have supplied, for a nao size of 7, you expect 406 ERI entries (your posted printout appears to be incomplete for the code you have shown, BTW). If we have one printout per thread, and we need to cover a space of 406 ERI entries, we had better have at least 406 threads, or our proposed solution simply cannot work. If we inspect your code to determine the size of the grid in terms of threads:
```
int dimx = 8;
int dimy = 8;
dim3 block(dimx, dimy);   // we will try blocks of 8x8 threads                                                                                                            
dim3 grid((nao+block.x-1)/block.x, (nao+block.y-1)/block.y);   // The grids are shaped accordingly                                                                        
```
we will arrive at the conclusion that you are simply not launching enough threads. If you work through the above calculations (you may want to printout the block.x,.y and grid.x,.y values if unsure), you will discover that you are launching one 8x8 threadblock, i.e. 64 threads total, for nao of 7. 64 threads, at one printout per thread, cannot cover an ERI space of 406 entries. In fact, given that your printf is conditioned by an if statement, it may be that you have less than one printout per thread.

Possible solution ideas:

A fairly straightforward approach would simply be to map the outer two for-loops in your C source code to x,y indices of a 2D grid. We would then retain, more or less intact, the inner for-loop structure of your C source code, as our kernel code. This is fairly easy to write. It will have the drawback that it will launch some threads that will do nothing (we must check for the condition where j>i and condition such threads to do nothing), but this may be a minor consideration. A bigger issue with this might be how many actual threads we are generating. For your given nao of 7, this would launch 7x7 = 49 threads (some of which do no work). A problem size of 49, or even 406 threads is tiny for a GPU, and it will not be able to achieve peak performance due to a limited number of threads. The "triangular" nature of this problem (j<=i) means that a further improvement of this method would be to launch a linear threadblock and then use a linear-to-triangular index mapping for i,j from the linear threadblock index, which would result in no "wasted" threads. However, a more important consideration than ~50% wasted threads would be the coalescing of global access, as already discussed.
Another possible approach would be to use the one suggested by @AngryLettuce in a comment on one of your previous questions (now deleted, perhaps). Specifically, generate a 1D grid and a 1D globally unique thread index, and then compute the necessary sub-indices (i,j,k,l) using arithmetic to convert the 1D index into the sub-indices. This has the advantage that for small problem sizes, we can directly launch larger kernels. For example, for your problem space of 406 ERI entries, we would generate (at least) 406 threads, with gobally unique indices of 0..405, and convert that 1-D index into 4 sub indices (i,j,k,l). Looking at your kernel code, this may be what you are trying to do. However, since your space is oddly-shaped, the arithmetic to convert from a linear index (or any set of rectangular indices) to an oddly-shaped space will be complicated, in my opinion.

If your real problem space is large (nao much larger than 7), then I personally would choose the first method, since it will be more readable code (IMO) and is easy to write/maintain. For small problem spaces, the second method may possibly be better, for the reasons already discussed. For either of the above methods, we would want to study the generated global memory access patterns that would result. This will depend on your data organization which you haven't shown. The first method might be easier to write to generate coalesced access, but careful use of the index generation arithmetic in the second method should (in theory) allow you to achieve whatever access patterns you wish as well.

Here's a complete code of method 1:

#include <stdio.h>
#include <iostream>

using namespace std;

__global__ void kernel(unsigned int nao)
{
    unsigned i = threadIdx.x+blockDim.x*blockIdx.x;
    unsigned j = threadIdx.y+blockDim.y*blockIdx.y;
    if (i >= nao) return;
    if (j > i) return; // modify if use of __syncthreads() after this

    unsigned int k,l,ltop;
            for(k=0;k<=i;k++)
            {
                ltop=k;
                if(i==k)
                {
                    ltop=j;
                }
                for(l=0;l<=ltop; l++)
                {
                    printf("computing the ERI (%d,%d|%d,%d) \n",i,j,k,l);
                }
            }
}


int main(){
    unsigned int nao = 7;
    dim3 block(4,4);
    dim3 grid((nao+block.x-1)/block.x, (nao+block.y-1)/block.y);
    kernel<<<grid,block>>>(nao);
    cudaDeviceSynchronize();
    int m = nao*(nao+1)/2;
    int eris_size = m*(m+1)/2;
    cout<<"The total size of the sack of ERIs is: "<<eris_size<<endl;

    return 0;
}

Note that I have used conditional return statements for ease of demonstration in this code (for the threads doing no work). However if you need to use __syncthreads() in your kernel, you will want to modify the kernel structure to avoid the conditional returns, and instead allow the "out of bounds" threads to continue through the kernel code while doing no work. This concept is covered in many other SO questions on __syncthreads() usage. Also note that kernel printout can occur in any order (just as threads can execute in any order) so I have only verified that the above method seems to work and produces the desired 406 lines of printout. A better verification approach would be to avoid use of printf and use a tally matrix instead.

For the second method, the conversion from a single linear index to a multidimensional space (with 1:1 mapping and no wasted/unused indices) is considerably complicated, as already mentioned, by the fact that your multi-dimensional space is "oddly shaped". I don't have a "great" method to convert from linear index to sub-indices. I had to map out the various problem space dimension sizes for the l loop based on the i,j,k indices:

i=0, j=0, k=0, l=0  (total 1 iteration for i=0)

l loop limit, for i=1: (total 5 iterations for i=1)
     j
    0 1
k 0 0 0
  1 0 1

l loop limit, for i=2: (total 15 iterations for i=2)
     j
    0 1 2
k 0 0 0 0
  1 1 1 1
  2 0 1 2

l loop limit, for i=3:  (total 34 iterations for i=3)
     j
    0 1 2 3
k 0 0 0 0 0
  1 1 1 1 1
  2 2 2 2 2
  3 0 1 2 3

 etc.

(Note that an alternative mapping method would be to treat the last row of each matrix above as constant equal to k, just like the other rows, and this would result in some wasted threads but considerably simplified index calculation below. We will discuss briefly the no-wasted-threads method, but ultimately present an example using the wasted threads method, because the indexing calculations are considerably simplified in both arithmetic and time complexity).

Studying the above, we see that the number of iterations mapped to a given i loop iteration is:

[((i+2)*(i+1)/2)*(i+1)] - (i+1)*i/2

or:

(((i+2)*(i+1)-i)/2)*(i+1)

The above calculation can be used to determine the indexing for i based on a linear index. The only way I know of to dissect a linear index using the above breakpoints would be via a binary search.

We would then repeat the above approach (determining number of iterations per "level", and the using a binary search to map the linear index to a level index) to compute the next indices j and k. The remaining quantity would be the l-index.

To simplify the remainder of the discussion, I have switched to using the modified mapping version:

l loop limit, for i=1: (total 6 iterations for i=1)
     j
    0 1
k 0 0 0
  1 1 1

l loop limit, for i=2: (total 18 iterations for i=2)
     j
    0 1 2
k 0 0 0 0
  1 1 1 1
  2 2 2 2

l loop limit, for i=3:  (total 40 iterations for i=3)
     j
    0 1 2 3
k 0 0 0 0 0
  1 1 1 1 1
  2 2 2 2 2
  3 3 3 3 3

 etc.

from which we will have some wasted threads which we will handle with an if condition in our kernel.

The number of iterations per i-level above is just:

(i+2)*(i+1)*(i+1)/2

Once we have computed the i-index from the above formula given a linear index (using binary search on the summation of the above polynomial), then we can compute the next index (j) quite easily, by dividing the remaining space into i equal sized pieces. The next k index can be found using a triangular mapping method and the remaining space then becomes our l loop extent. However we must remember to condition this l index as discussed previously.

Here's a complete code for method 2:

#include <stdio.h>
#include <iostream>

// the closed-form summation of the polynomial (i+2)*(i+1)*(i+1)/2 from 0 to n
__host__ __device__ unsigned my_func(unsigned i){ return 1+(2*i+(((i*(i+1))/2)*((i*(i+1))/2)) + (i*(i+1)*((2*i)+1)*2)/3 + ((5*i*(i+1))/2))/2; }

// binary search
__host__ __device__ unsigned bsearch_functional(const unsigned key, const unsigned len_a, unsigned (*func)(unsigned)){
  unsigned lower = 0;
  unsigned upper = len_a;
  unsigned midpt;
  while (lower < upper){
    midpt = (lower + upper)>>1;
    if (func(midpt) <= key) lower = midpt +1;
    else upper = midpt;
    }
  return lower;
}

// conversion of linear index to triangular matrix (row,col) index
__host__ __device__ void linear_to_triangular(const unsigned i, const unsigned n, unsigned *trow, unsigned *tcol)
{
    int c = i;
    int row = c/(n-1);
    int col = c%(n-1);

    if (col < row) {
      col = (n-2)-col;
      row = (n-1)-row;
    }

    *tcol = col+1;
    *trow = row;
}



__global__ void kernel(unsigned int nao, unsigned int eris_size)
{
    unsigned idx = threadIdx.x+blockDim.x*blockIdx.x;
    if (idx < eris_size){
    // compute i-index via binary search
    unsigned int i = bsearch_functional(idx, nao, my_func);
    // compute j-index via division of the space by i;
    unsigned int j = (idx==0)?0:(idx-my_func(i-1))/((my_func(i)-my_func(i-1))/(i+1));
    unsigned k,l;
    linear_to_triangular((idx-my_func(i-1)-(j *((my_func(i)-my_func(i-1))/(i+1)))), (i+2), &k, &l);
    k = i-k;
    l = (i+1)-l;
    if (idx == 0) {k=0; l=0;}
    if (l <= ((i==k)?j:k))
      printf("computing the ERI (%d,%d|%d,%d) \n",i,j,k,l);
    }
}

int main(){
    unsigned int nao = 7;
    int m = nao*(nao+1)/2;
    int eris_size = m*(m+1)/2;
    const int nTPB = 64;
    std::cout<<"The total size of the sack of ERIs is: "<<eris_size<<std::endl;
    int mod_eris_size = my_func(nao-1);
    kernel<<<mod_eris_size+nTPB-1/nTPB, nTPB>>>(nao, mod_eris_size);
    cudaDeviceSynchronize();
    return 0;
}

To be clear, I don't know exactly what your task is, and I am not guaranteeing these examples are bug-free for any given usage. My intent is not to give you a black-box, but to explain some possible programming approaches. I didn't do rigorous verification but merely observed that each method produced 406 lines of ERI output, same as your original C code.

Finally, I have omitted proper cuda error checking for brevity of presentation, but any time you are having trouble with a cuda code, I recommend using it and running your code with cuda-memcheck.

Thanks a lot for your advice. I hope to be able to manage computations with 1015. Each integer in the quartet (ij|kl) point to the indices in an array that contains the input parameters for the computation of one integral. — user3116936, Dec 31 '15 at 02:21
For `nao~100` the first method will probably be easier to work with and should give good results. For `nao~20` or less, you may want to try the second method. — Robert Crovella, Dec 31 '15 at 02:33
I run test computations using both methods and works pretty well, but a new dificulty arises: often, the memory available in the GPU is not enough to store all the results at once. Is possible to adapt method 1 and method 2 to work by non-overlaping chunck of indices? — user3116936, Jan 06 '16 at 20:06
I would think method 1 would be pretty easy to adapt. If you can write your original 4-nested-loop C code to handle chunks, then you should be able to re-write the method 1 kernel to handle chunks. Probably you should ask a new question. — Robert Crovella, Jan 06 '16 at 21:04

Kernel for processing a 4D tensor in CUDA

1 Answers1

Linked