Explicit FDM with CUDA

Question

I am working to implement CUDA for the following code. The first version has been written serially and the second version is written with CUDA. I am sure about its results in serial version. I expect that the second version that I have added CUDA functionality also give me the same result, but it seems that kernel function does not do any thing and it gives me the initial value of u and v. I know due to lack of my experience, the bug may be obvious, but I cannot figure it out. Also, please do not recommend using flatten array, because it is harder for me to understand the indexing in code. First version:

#include <fstream>
#include <iostream>
#include <math.h>
#include <vector>
#include <chrono>
#include <omp.h>
using namespace std;
const int M = 1024; 
const int N = 1024; 
const double A = 1;
const double B = 3;
const double Du = 5 * pow(10, -5);
const double Dv = 5 * pow(10, -6);
const int Max_Itr = 1000;
const double h = 1.0 / static_cast<double>(M - 1);
const double delta_t = 0.0025;
const double s1 = (Du * delta_t) / pow(h, 2);
const double s2 = (Dv * delta_t) / pow(h, 2);
int main() {
    double** u=new double* [M]; 
    double** v=new double* [M];
        for (int i=0; i<M; i++){
            u[i]=new double [N]; 
            v[i]=new double [N]; 
        }
    for (int j = 0; j < M; j++) {
        for (int i = 0; i < N;i++) {
            u[i][j]=0.02;
            v[i][j]=0.02;

        }
    }

    for (int k = 1; k < Max_Itr; k++) {
            
        for (int i = 1; i < N - 1; i++) {
            for (int j = 1; j < M - 1; j++) {

                u[i][j] = ((1 - (4 * s1)) * u[i][j]) + (s1 * (u[i + 1][j] + u[i - 1][j] + u[i][j + 1] + u[i][j - 1])) +

                    (A * delta_t) + (delta_t * pow(u[i][j], 2) * v[i][j]) - (delta_t * (B + 1) * u[i][j]);


                v[i][j] = ((1 - (4 * s2)) * v[i][j]) + (s2 * (v[i + 1][j] + v[i - 1][j] + v[i][j + 1] + v[i][j - 1])) + (B * delta_t * u[i][j])
                    - (delta_t * pow(u[i][j], 2) * v[i][j]);

                }

            }


        }
        cout<<"u: "<<u[512][512]<<" v: "<<v[512][512]<<endl; 

    return 0;
}

Second version:

#include <fstream>
#include <iostream>
#include <math.h>
#include <vector>
using namespace std;

#define M 1024

#define N 1024


__global__ void my_kernel(double** v, double** u){
    int i= blockIdx.y * blockDim.y + threadIdx.y;
    int j= blockIdx.x * blockDim.x + threadIdx.x;
    double A = 1;
    double B = 3;
    int Max_Itr = 1000;
    double delta_t = 0.0025;
    double Du = 5 * powf(10, -5);
    double Dv = 5 * powf(10, -6);
    double h = 1.0 / (M - 1);
    double s1 = (Du * delta_t) / powf(h, 2);

    double s2 = (Dv * delta_t) / powf(h, 2);
    for (int k = 1; k < Max_Itr; k++) {
        u[i][j] = ((1 - (4 * s1)) 
        * u[i][j]) + (s1 * (u[i + 1][j] + u[i - 1][j] + u[i][j + 1] + u[i][j - 1])) +
        (A * delta_t) + (delta_t * pow(u[i][j], 2) * v[i][j]) - (delta_t * (B + 1) * u[i][j]);

        v[i][j] = ((1 - (4 * s2)) 
        * v[i][j]) + (s2 * (v[i + 1][j] + v[i - 1][j] + v[i][j + 1] + v[i][j - 1])) + (B * delta_t * u[i][j])
        - (delta_t * pow(u[i][j], 2) * v[i][j]);

    
       
        __syncthreads();
    }
}

int main() {
    double** u=new double* [M]; 
    double** v=new double* [M];
    for (int i=0; i<M; i++){
    
        u[i]=new double [N]; 
        v[i]=new double [N]; 
    }

    dim3 blocks(32,32);
        dim3 grids(M/32 +1, N/32 + 1);
    for (int j = 0; j < M; j++) {
        for (int i = 0; i < N;i++) {
        u[i][j]=0.02;
        v[i][j]=0.02;
           
        }
    }
     double **u_d, **v_d;
     int d_size = N * M * sizeof(double);

     cudaMalloc(&u_d, d_size);
     cudaMalloc(&v_d, d_size);
     cudaMemcpy(u_d, u, d_size, cudaMemcpyHostToDevice);
         cudaMemcpy(v_d, v, d_size, cudaMemcpyHostToDevice);
     my_kernel<<<grids, blocks>>> (v_d,u_d);
         cudaDeviceSynchronize();
     cudaMemcpy(v, v_d, d_size, cudaMemcpyDeviceToHost);
     cudaMemcpy(u, u_d, d_size, cudaMemcpyDeviceToHost);
     cout<<"u: "<<u[512][512]<<" v: "<<v[512][512]<<endl; 



    return 0;
}

What I expect from the second version is :

u: 0.2815 v: 1.7581

General advice: Always use proper status checks for all CUDA API calls and all kernel launches. The host code uses an array of pointers to row vectors to store a 2D matrix. However, this is not reflected in the code that moves the data to the GPU. It consists of a single call to `cudaMemcpy()`, as if the data were stored in a contiguously stored 2D array. CUDA is a subset of C++, and there is nothing CIDA-specific about this copy. You need to copy the individual row vectors to the GPU one-by-one. Warning: while this will result in functional code, it will have a negative impact on performance. — njuffa, Apr 04 '21 at 08:52
@MAero2020: No, njuffa means that whenever you make any CUDA API call, you must save its return value and ensure the API call has not failed. Alternatively, use wrappers which do error checking for you. See [this question](https://stackoverflow.com/q/14038589/1593077) about how to check for errors. I'm partial to [this solution](https://stackoverflow.com/a/20478474/1593077) (since I wrote it... also because it's nice). — einpoklum, Apr 04 '21 at 18:08
@einpoklum Thanks for the post; I will read it carefully! in fact besides error checking, as njuffa mentioned, there is something wrong with the way I am passing arrays in `cudaMemcpy()` — MA19, Apr 04 '21 at 18:16

score 2 · Accepted Answer · answered Apr 04 '21 at 18:17

Your two-dimensional array - in the first version of the program - is implemented using an array of pointers, each of which to a separately-allocated array of double values.

In your second version, you are using the same pointer-to-pointer-to-double type, but - you're not allocating any space for the actual data, just for the array of pointers (and not copying any of the data to the GPU - just the pointers; which are useless to copy anyway, since they're pointers to host-side memory.)

What is most likely happening is that your kernel attempts to access memory at an invalid address, and its execution is aborted.

If you were to properly check for errors, as @njuffa noted, you would know that is what happened.

Now, you could avoid having to make multiple memory allocations if you were to use a single data area instead of separate allocations for each second-dimension 1D array; and that is true both for the first and the second version of your program. That would not quite be array flattening. See an explanation of how to do this (C-language-style) on this page.

Note, however, that double-dereferencing, which you insist on performing in your kernel, is likely slowing it down significantly.

Explicit FDM with CUDA

1 Answers1