Copying variables to GPU memory in CUDA

Question

I'm just getting into CUDA and I've run into a problem that I can't seem to figure out. Basically I'm writing a program to use Simpson's method to numerically integrate the function f(x) = x^2. My method to do so is to create an array of the bounds between each box whose area is calculated with Simpson's method, pass that array of bounds into the GPU, have each processor find the area of each bounded box and return an array of resultant areas. The areas are then added up to get the full integral. My issue comes when I try to access the array of bounds on the GPU. The array is fine and has proper values on the CPU, but after I copy it over and access it on the GPU, the values are all nonsense and I cannot find the reason. My code is below, any help would be very appreciated.

The first class is the main class which gets user input and defines the CPU arrays.

#include <iostream> //Necessary for std::cout
#include <iomanip>  //Necessary for std::setprecision
#include <ctime>    //Necessary for clock_t

using namespace std;

double* gpu_run(double * h_bound, double * h_resultArr, int SIZE);

int main() {

    double step = 0.5, upper = 0, result = 0;
    double * h_bound = NULL, * h_resultArr = NULL;
    int SIZE = 0;

    cout << "Enter the upper bound: ";
    cin >> upper;

    SIZE = upper/step + 1; //The number of bounds, which is one more than the number of integration times

    h_bound = new double[SIZE];
    h_resultArr = new double[SIZE-1];

    for (int i = 0; i < SIZE; i++){
        h_bound[i] = i*step;
    }

    clock_t t = clock();

    h_resultArr = gpu_run(h_bound, h_resultArr, SIZE);

    for (int i = 0; i < SIZE; i++){
        result += h_resultArr[i];
    }

    t = clock() - t;

    cout << "Calculation is done and took " << ((double)t)/CLOCKS_PER_SEC << " seconds." << endl;

    cout << "The integral of x^2 from 0 to " << upper << ", using Simpson's Method is: " << setprecision(10) << result << endl;

    return 0;
}

At this point the main method has called gpu_run which is the cuda code containing the method to do the actual calculations. When I use an upper bound of 3 for the integral (because the answer should be exactly 9), and using step sizes of 0.5, I get the bounds of 0, 0.5, 1, 1.5, 2, 2.5, and 3, as I expect. The gpu_run code is

#include <iostream>
#include <cuda_runtime.h>
#include <cstdio>
#include "gpu_run.h"

using namespace std;

double* gpu_run(double * h_bound, double * h_resultArr, int SIZE) {

    double * d_bound = NULL;
    cudaMalloc((void **)&d_bound, sizeof(double)*SIZE);

    cudaMemcpy(d_bound, h_bound, SIZE, cudaMemcpyHostToDevice);

    double * d_resultArr = NULL;
    cudaMalloc((void **)&d_resultArr, sizeof(double)*(SIZE-1));

    int threadsPerBlock = 256;
    int blocksPerGrid =(SIZE + threadsPerBlock - 1) / threadsPerBlock;
    simpsons<<<blocksPerGrid, threadsPerBlock>>>(d_bound, d_resultArr, SIZE);

    cudaMemcpy(h_resultArr, d_resultArr, SIZE-1, cudaMemcpyDeviceToHost);

    return h_resultArr;

}

In this program the d_ prescript is used to designate the arrays which exist on the GPU, the values stored in h_bound are still correct here. Lastly, I have the header for the CUDA method simpsons which is called

__global__ void simpsons(double * bound, double * resultArr, int SIZE){

    int i = blockDim.x * blockIdx.x + threadIdx.x;

    if (i < SIZE-1){
        double a = bound[i];
        double b = bound[i+1];

        printf("i: %d lower bound: %d  upper bound: %d \n", i, bound[i], bound[i+1]);

        resultArr[i] = ((b-a)/6)*(a*a + (a+b)*(a+b) + b*b); 
    }

}

For each processor I need it to access the two bounds of its respective "box" under the function and use simpson's method at those two bounds to calculate the area, however, the values in the bound array in this method are nonsense values. What am I doing wrong? I feel like it is a really stupid mistake, but I just can't seem to find it.

You should add some error checking. It could be that the CUDA API failed to copy over the data but you would never know. Check here : http://stackoverflow.com/questions/14038589/ — deathly809, Jul 21 '14 at 16:30
Both cudaMemcpy arguments are copying what I presume is an incorrect size to and from the device... — talonmies, Jul 21 '14 at 16:39
I do see that I should have used sizeof(double)*SIZE in the cudaMemcpy method, but even with this, the problem persists. And why does my question have -1. Did I not write my question properly? — zephyr, Jul 21 '14 at 17:26
@deathly809 I have added an error check similar to those described in your link, but my program runs with no errors. It executes just fine, it just doesn't do what it is supposed to do. — zephyr, Jul 21 '14 at 17:30

score 1 · Accepted Answer · answered Jul 21 '14 at 18:21

When I change your code to fix the issue pointed out by @talonmies, and modify the printf format specifier to the correct one for printing out float/double quantities, I get what appear to be correct results, based on your description:

modify the cudaMemcpy operations to include sizeof(double)
replace %d with %f when printing out floating point quantities
I also added a cast to eliminate a warning about converting double to int

Fixed code and results:

$ cat t495.cu
#include <iostream> //Necessary for std::cout
#include <iomanip>  //Necessary for std::setprecision
#include <ctime>    //Necessary for clock_t

using namespace std;

__global__ void simpsons(double * bound, double * resultArr, int SIZE){

    int i = blockDim.x * blockIdx.x + threadIdx.x;

    if (i < SIZE-1){
        double a = bound[i];
        double b = bound[i+1];

        printf("i: %d lower bound: %lf  upper bound: %lf \n", i, bound[i], bound[i+1]);


        resultArr[i] = ((b-a)/6)*(a*a + (a+b)*(a+b) + b*b);
    }

}

double* gpu_run(double * h_bound, double * h_resultArr, int SIZE) {

    double * d_bound = NULL;
    cudaMalloc((void **)&d_bound, sizeof(double)*SIZE);

    cudaMemcpy(d_bound, h_bound, SIZE*sizeof(double), cudaMemcpyHostToDevice);

    double * d_resultArr = NULL;
    cudaMalloc((void **)&d_resultArr, sizeof(double)*(SIZE-1));

    int threadsPerBlock = 256;
    int blocksPerGrid =(SIZE + threadsPerBlock - 1) / threadsPerBlock;
    simpsons<<<blocksPerGrid, threadsPerBlock>>>(d_bound, d_resultArr, SIZE);

    cudaMemcpy(h_resultArr, d_resultArr, (SIZE-1)*sizeof(double), cudaMemcpyDeviceToHost);

    return h_resultArr;

}

int main() {

    double step = 0.5, upper = 0, result = 0;
    double * h_bound = NULL, * h_resultArr = NULL;
    int SIZE = 0;

    cout << "Enter the upper bound: ";
    cin >> upper;

    SIZE = (int)(upper/step + 1); //The number of bounds, which is one more than the number of integration times

    h_bound = new double[SIZE];
    h_resultArr = new double[SIZE-1];

    for (int i = 0; i < SIZE; i++){
        h_bound[i] = i*step;
    }

    clock_t t = clock();

    h_resultArr = gpu_run(h_bound, h_resultArr, SIZE);

    for (int i = 0; i < SIZE; i++){
        result += h_resultArr[i];
    }

    t = clock() - t;

    cout << "Calculation is done and took " << ((double)t)/CLOCKS_PER_SEC << " seconds." << endl;

    cout << "The integral of x^2 from 0 to " << upper << ", using Simpson's Method is: " << setprecision(10) << result << endl;

    return 0;
}
$ nvcc -arch=sm_20 -o t495 t495.cu
$ cuda-memcheck ./t495
========= CUDA-MEMCHECK
Enter the upper bound: 3
i: 0 lower bound: 0.000000  upper bound: 0.500000
i: 1 lower bound: 0.500000  upper bound: 1.000000
i: 2 lower bound: 1.000000  upper bound: 1.500000
i: 3 lower bound: 1.500000  upper bound: 2.000000
i: 4 lower bound: 2.000000  upper bound: 2.500000
i: 5 lower bound: 2.500000  upper bound: 3.000000
Calculation is done and took 1.08 seconds.
The integral of x^2 from 0 to 3, using Simpson's Method is: 9
========= ERROR SUMMARY: 0 errors
$

Copying variables to GPU memory in CUDA

1 Answers1