How do I know if OpenCL kernel is working properly?

Question

I have written simple code to solve advection equation using OpenC, and writing the results into netcdf file. The code does not produce any error messages during compilation, and the it runs without any errors. But it seems that the kernel is not doing anything. The kernel is looping the numerical scheme about 3000 times, and if it works correctly I should be seeing something very different. Is there a way to find out if the kernel is working properly, something like printing?

Below is the kernel

void pbndry(int in_x_siz, int in_y_siz, int in_z_siz, global float *in_arr)
{
   int i,j,k;

   // Periodic boundary
   // x-direction
   for(k=1;k<in_z_siz+1;k++)
      for(j=1;j<in_y_siz+1;j++)
         {
         in_arr[k * (in_y_siz+2) * (in_x_siz+2) + j * (in_x_siz+2) + 0] =
         in_arr[k * (in_y_siz+2) * (in_x_siz+2) + j * (in_x_siz+2) + in_x_siz];

         in_arr[k * (in_y_siz+2) * (in_x_siz+2) + j * (in_x_siz+2) + (in_x_siz+1)] =
         in_arr[k * (in_y_siz+2) * (in_x_siz+2) + j * (in_x_siz+2) + 1];
         }

   // y-direction
   for(k=1;k<in_z_siz+1;k++)
      for(i=1;i<in_x_siz+1;i++)
         {
         in_arr[k * (in_y_siz+2) * (in_x_siz+2) + 0 * (in_x_siz+2) + i] =
         in_arr[k * (in_y_siz+2) * (in_x_siz+2) + in_y_siz * (in_x_siz+2) + i];

         in_arr[k * (in_y_siz+2) * (in_x_siz+2) + (in_y_siz+1) * (in_x_siz+2) + i] =
         in_arr[k * (in_y_siz+2) * (in_x_siz+2) + 1 * (in_x_siz+2) + i];
         }

   // z-direction
   for(j=1;j<in_y_siz+1;j++)
      for(i=1;i<in_x_siz+1;i++)
         {
         in_arr[0 * (in_y_siz+2) * (in_x_siz+2) + j * (in_x_siz+2) + i] =
         in_arr[in_z_siz * (in_y_siz+2) * (in_x_siz+2) + j * (in_x_siz+2) + i];

         in_arr[(in_z_siz+1) * (in_y_siz+2) * (in_x_siz+2) + j * (in_x_siz+2) + i] =
         in_arr[1 * (in_y_siz+2) * (in_x_siz+2) + j * (in_x_siz+2) + i];
         }
}


kernel void leapfrog3d(
                        const  int x_siz,
                        const  int y_siz,
                        const  int z_siz,
                        const  int t_siz,
                        global float *in_p_tf,
                        global float *in_p_tn,
                        global float *in_p_tp,
                        const  float u_vel,
                        const  float v_vel,
                        const  float w_vel,
                        const  float c,
                        global float *in_p_rs
                      )
{
   int nx   =  x_siz;
   int ny   =  y_siz;
   int nz   =  z_siz;
   int nt   =  t_siz;
   float u  =  u_vel;
   float v  =  v_vel;
   float w  =  w_vel;
   float C  =  c    ;
   int i    =  get_global_id(0);
   int j    =  get_global_id(1);
   int k    =  get_global_id(2);

   int idx0, idx_i0, idx_i1, idx_j0, idx_j1, idx_k0, idx_k1;


   for(int t=1;t<t_siz;t++)
   {

      idx0     =  i + j * (nx+2) + k * (nx+2) * (ny+2);

      idx_i0   =  (i+1) + j * (nx+2) + k * (nx+2) * (ny+2);
      idx_j0   =  i + (j+1) * (nx+2) + k * (nx+2) * (ny+2);
      idx_k0   =  i + j * (nx+2) + (k+1) * (nx+2) * (ny+2);
      
      idx_i1   =  (i-1) + j * (nx+2) + k * (nx+2) * (ny+2);
      idx_j1   =  i + (j-1) * (nx+2) + k * (nx+2) * (ny+2);
      idx_k1   =  i + j * (nx+2) + (k-1) * (nx+2) * (ny+2);

      in_p_tf[idx0] = in_p_tp[idx0] 
                   - u_vel * C * (in_p_tn[idx_i0] - in_p_tn[idx_i1])
                   - v_vel * C * (in_p_tn[idx_j0] - in_p_tn[idx_j1])
                   - w_vel * C * (in_p_tn[idx_k0] - in_p_tn[idx_k1]);

      pbndry(nx,ny,nz,in_p_tf);

      in_p_tp = in_p_tn;
      in_p_tn = in_p_tf;
   }

   in_p_rs = in_p_tf;

}

ProjectPhysX · Accepted Answer · 2021-08-30T17:05:51.743

0

Two useful tricks:

Comment out parts of the kernel, check the results on the host side to see if it runs through.
Inside the kernel, you can also use if(get_global_id(0)==0) printf("test\n"); to see how far it runs and where it gets stuck. You might need to enable printf depending on what GPU you have: #pragma OPENCL EXTENSION cl_amd_printf : enable / #pragma OPENCL EXTENSION cl_intel_printf : enable.

A few more remarks:

The issue with your kernel is probably the nested loops within it. If the kernel runs for too many iterations / too long, it may get aborted.
If the iterations in your two nested loops are independent of each other, parallelize them.
Get rid of the for(int t=1;t<t_siz;t++) time step loop in the kernel entirely. Instead, let the kernel only do a single iteration and on the host side, call the kernel t_siz times.

edited Aug 30 '21 at 17:05

answered Aug 30 '21 at 07:42

ProjectPhysX

4,535
2
14
34

Thank you for your advice. 1. I am currently using NVIDIA GPU. I could not find any pragma for NVIDIA GPU. Just using `printf()` inside kernel gives me `CL_OUT_OF_RESOURCES` error. What should I do in this case? 2. When you referred to "two nested loops", did you mean the `void pbndry` function in the kernel? Elements are independent from each other, so I think I can parallelize them. In this case is it possible to use something like `int i = get_local_id(0)`? 3. Wouldn't calling the kernel `t_siz` be really inefficient (I am pretty new to OpenCL, so I really don't know)? – Redshoe Aug 30 '21 at 15:30
1. For Nvidia `printf` is enabled by default, you don't need pragma. The `CL_OUT_OF_RESOURCES` originates in your kernel running for too long because of the nested loops. Parallelize them. 2. Exactly. One thread should only compute a single grid point, see here https://stackoverflow.com/a/61652001/9178992 Use `int n = get_local_id(0);` and disassemble this linear index into the 3 spatial coordinates by modulo and integer division (for 2D: `int x=n%size_x, y=n/size_x;`). 3. Not if each kernel call for 1 time step saturates the GPU (because each of the GPU cores computes a single grid point). – ProjectPhysX Aug 30 '21 at 17:01
In numerical n-body/grid integration like you have, it is also not possible to do multiple time steps in a single kernel. You need global synchronization after every time step, and this is only achieved when a kernel ends. Think of it this way: The GPU computes each grid point in parallel (get rid of the nested loops for this), some earlier than others and in random order. You can only start the next time step if all grid points in the previous step are done. – ProjectPhysX Aug 30 '21 at 17:05
1

After reading your replies, i am thinking that I should re-write my entire code to accommodate the local size, group size, etc. I am having trouble understanding "disassembling linear index into the 3 spatial coordinates". Could you explain what this is for? I have not specified local work size in the main code. Should I do this? If I am just using `get_global_id()` in `void pbndry()` would that be okay? – Redshoe Aug 31 '21 at 00:24
Another question. If I were to execute the kernel `nt` times, should I re-write data on the cl_mem objects and do whatever for each loop? – Redshoe Aug 31 '21 at 01:52
Rewriting is the best option. In OpenCL you can only have 1D arrays. To store a 3D grid in 1D, you linearize it: `n=x+y*(z*Ly)*Lx`. This way you go from 3D coordinates to the 1D memory address. The other way arouund is `x=n%(Lx*Ly)%Lx, y=n%(Lx*Ly)/Lx, z=n/(Lx*Ly)`. If you don't need communication between local threads (as an optimization), just set the workgroup size to 32 or a multiple of 32, it doesn't really matter. When you execute the kernel multiple times, don't erease the GPU memory in between by `enqueueWriteBuffer` and don't copy it to the host if you don't need to (it's very slow). – ProjectPhysX Aug 31 '21 at 07:00
Thanks. I think I can change 3D to 1D. Another question would be, does sub-function in OpenCL work as same as the main function in the kernel? Can I just use global_id just like in the main function, and specify local size in the main part of the code? – Redshoe Aug 31 '21 at 12:07
Functions basically work the same as kernels, with the difference that kernel parameters are global by default and function parameters are private by default. Functions are usually inlined, so the compiler copies the content of the function in the kernel where it is called. You can use `get_global_id(0);` from anywhere. You specity the local size in `enqueueNDRangeKernel`, where you hand it over as local range. – ProjectPhysX Aug 31 '21 at 15:29
1

Okay. Thanks. I'll try that. – Redshoe Sep 02 '21 at 02:22

How do I know if OpenCL kernel is working properly?

1 Answers1