CUDA : programming with twice as much blocks (tiling?)

Question

My 3D Laplacian solver works. I obtain a power of 350 Gflop/s, I'm trying to upgrade it to have better performance with twice as much blocks. However, performances still being 350 Gflop/s:

 #include <iostream>
 #include <sys/time.h>
 #include <cuda.h>
 #include <ctime>
 #include"res3dcb.cuh"
 #include <math.h>
 using namespace std;

 // Constant statement.
 const int blocksize=32;
 const int N=128;
 const int size=(N+2)*(N+2)*(N+2)*sizeof(float);

 // Let's start the main program.
 int main(void) {

 // Variable statement.
 float time1,time2,time3;
 float *x_d, *y_d; 
 float *x,*y; 
 float gflops;
 float NumOps;
 int power=4; // You can change power as you prefer (but keep 2^x)

 // Init x and y. 
 x = new float[size];
 y = new float[size];

 for (int k=1;k<N+1;k++)
    for (int i=1;i<N+1;i++) 
        for (int j=1;j<N+1;j++) { 
            x[k*(N+2)*(N+2)+i*(N+2)+j]=cos(i+j+k);
        }

 // Shadow cases.
 for (int k=1;k<N+1;k++) {
    for (int i=1;i<N+1;i++) { 
      x[k*(N+2)*(N+2)+i*(N+2)]=x[k*(N+2)*(N+2)+i*(N+2)+1]; 
      x[k*(N+2)*(N+2)+i*(N+2)+N+1]=x[k*(N+2)*(N+2)+i*(N+2)+N];}

    for (int j=0;j<N+2;j++) { 
      x[k*(N+2)*(N+2)+j]=x[k*(N+2)*(N+2)+(N+2)+j]; 
      x[k*(N+2)*(N+2)+(N+1)*(N+2)+j]=x[k*(N+2)*(N+2)+N*(N+2)+j];}

 for (int i=0;i<N+2;i++) 
    for (int j=0;j<N+2;j++) {
        x[(N+2)*i+j]=x[(N+2)*(N+2)+(N+2)*i+j];
        x[(N+1)*(N+2)*(N+2)+(N+2)*i+j]=x[(N+2)*(N+2)*N+(N+2)*i+j];
    }

 // Display of initial matrix.
 int id_stage=-2;
 while (id_stage!=-1) {
     cout<<"Which stage do you want to display? (-1 if you don't want to diplay another one)"<<endl;
     cin>>id_stage;
     cout<<endl;

     if (id_stage != -1) {
    cout<<"Etage "<<id_stage<<" du cube :"<<endl;
    for (int i=0;i<N+2;i++) {
        cout<<"| ";
        for (int j=0;j<N+2;j++) {cout<<x[id_stage*(N+2)*(N+2)+i*(N+2)+j]<<" ";}
        cout<<"|"<<endl;
        }
         cout<<endl;
     }
 }

 // CPU to GPU.
 cudaMalloc( (void**) & x_d, size);
 cudaMalloc( (void**) & y_d, size);

 cudaMemcpy(x_d, x, size, cudaMemcpyHostToDevice) ;
 cudaMemcpy(y_d, y, size, cudaMemcpyHostToDevice) ;

 // Solver parameters.
 dim3 dimGrid(power*N/blocksize, power*N/blocksize);
 dim3 dimBlock(blocksize, blocksize);

 // Solver loop.
 time1=clock();

 res2d<<<dimGrid, dimBlock>>>(x_d, y_d, N, power); 

 time2=clock();
 time3=(time2-time1)/CLOCKS_PER_SEC;

 // Power calculation.
 NumOps=(1.0e-9)*N*N*N*7;
 gflops = ( NumOps / (time3));

 // GPU to CPU.
 cudaMemcpy(y, y_d, size, cudaMemcpyDeviceToHost);
 cudaFree(x_d);
 cudaFree(y_d);

 // Display of final matrix.
 id_stage=-2;
 while (id_stage!=-1) {
    cout<<"Which stage do you want to display? (-1 if you don't want to diplay another one)"<<endl;
    cin>>id_stage;
    cout<<endl;

     if (id_stage != -1) {
        cout<<"Etage "<<id_stage<<" du cube :"<<endl;
        for (int i=0;i<N+2;i++) {
            cout<<"| ";
            for (int j=0;j<N+2;j++) {cout<<y[id_stage*(N+2)*(N+2)+i*(N+2)+j]<<" ";}
            cout<<"|"<<endl;
         }
        cout<<endl;
     }
 }

 cout<<"Time : "<<time3<<endl;
 cout<<"Gflops/s : "<<gflops<<endl;

 }

Where:

__ global__ void res2d(volatile float* x, float* y, int N, int power) 
{
    int i = threadIdx.x + blockIdx.x*(blockDim.x);
    int j = threadIdx.y + blockIdx.y*(blockDim.y);
    int id,jd;

    #pragma unroll //Now let's recude the number of operation per block
    for (int incr=1; incr<power+1; incr++) {
        if (i>(incr-1)*N && i<incr*N && j>(incr-1)*N && j<incr*N) {
            #pragma unroll
            for (int k=(incr-1)*(N/power) ; k<incr*N/power ; k++) {
                id=i-(incr-1)*N;
                jd=j-(incr-1)*N;
                y[(N+2)*(N+2)*(k+1)+(N+2)*(id+1)+jd+1] = x[(N+2)*(N+2)*(k+1)+(N+2)*(id+2)+jd+1] 
                                                       + x[(N+2)*(N+2)*(k+1)+(N+2)*id+jd+1] 
                                                       + x[(N+2)*(N+2)*(k+1)+(N+2)*(id+1)+jd+2] 
                                                       + x[(N+2)*(N+2)*(k+1)+(N+2)*(id+1)+jd] 
                                                       + x[(N+2)*(N+2)*(k+2)+(N+2)*(id+1)+jd+1] 
                                                       + x[(N+2)*(N+2)*k+(N+2)*(id+1)+jd+1] 
                                                       - 6*x[(N+2)*(N+2)*(k+1)+(N+2)*(id+1)+jd+1];
            }   
        }
    }
}

With parameters:

dimGrid(power * N/blocksize, power * N/blocksize) & dimBlock(blocksize, blocksize)

Questions:

If power= 2,4 or 8, number of operations per block is divided by 2,4 or 8. But it's not faster. Why?
Is it useless to reduce the number of operation per block?

Thanks in advance for your help.

FLOPs are probably not the best measure to use in this case. Instead you should probably use memory throughput (GB/s). You should be moving 8 bytes per grid point at 7 FLOPs per grid point. So, 350 GFLOPs works out to be 400 GB/s... which is faster than any single GPU can do. We're going to need a full reproducer for this as I don't believe your numbers. — Jez, Sep 23 '15 at 14:17
I'm only measuring the FLOPs of the loop. I don't consider the time of Memcpy. — Loïc Madiès, Sep 23 '15 at 14:30
In fact I initially ask if it's usefull to increase the numbers of block and decrease the number of operation per block. So I don't really understand why it's not a MCVE. Or maybe I miss something, sorry I'm new here ;-) — Loïc Madiès, Sep 24 '15 at 08:33
Is this of any help: http://stackoverflow.com/questions/8389648/how-do-i-achieve-the-theoretical-maximum-of-4-flops-per-cycle — Ziezi, Sep 24 '15 at 10:18
CUDA kernel launches are asynchronous, so you're not actually timing the kernel, you're timing how long it takes to launch the kernel. For timing in CUDA, check out http://stackoverflow.com/a/7881320/3760791 — Jez, Sep 24 '15 at 11:13
Thank you a lot Jez! Indeed Gflops are considerably less than before. With this new measure I conclude that the program run faster with power=4 than power=1. — Loïc Madiès, Sep 24 '15 at 11:57
Yes, to measure the real gflops we need to put "cudaDeviceSynchronize();" after calling the kernel. — Loïc Madiès, Sep 28 '15 at 11:37
@LoïcMadiès: I've added a summary of all the comments into a community wiki answer. It would be good if you could accept it so that this question drops of the unaswered list for the CUDA question. It will make the answer easier to find in search for the next person who comes along with a similar problem. — talonmies, Oct 17 '15 at 08:15

score 2 · Accepted Answer · answered Oct 11 '15 at 09:31

CUDA kernel launches are asynchronous. When you do this:

 // Solver loop.
 time1=clock();

 res2d<<<dimGrid, dimBlock>>>(x_d, y_d, N, power); 

 time2=clock();
 time3=(time2-time1)/CLOCKS_PER_SEC;

the timer is only capturing the API launch latency, not the actual execution time of the code. This is why changing the amount of work done in the kernel is apparently having no effect on performance -- your timing method is incorrect.

Do something like this instead:

 // Solver loop.
 time1=clock();

 res2d<<<dimGrid, dimBlock>>>(x_d, y_d, N, power); 
 cudaDeviceSynchronize();

 time2=clock();
 time3=(time2-time1)/CLOCKS_PER_SEC;

This inserts a blocking call which will ensure that the kernel is finished execution before the time is measured.

[This answer added as a community wiki entry to get the question off the unanswered queue].

CUDA : programming with twice as much blocks (tiling?)

Questions:

1 Answers1