I'm just getting into CUDA and I've run into a problem that I can't seem to figure out. Basically I'm writing a program to use Simpson's method to numerically integrate the function f(x) = x^2. My method to do so is to create an array of the bounds between each box whose area is calculated with Simpson's method, pass that array of bounds into the GPU, have each processor find the area of each bounded box and return an array of resultant areas. The areas are then added up to get the full integral. My issue comes when I try to access the array of bounds on the GPU. The array is fine and has proper values on the CPU, but after I copy it over and access it on the GPU, the values are all nonsense and I cannot find the reason. My code is below, any help would be very appreciated.
The first class is the main class which gets user input and defines the CPU arrays.
#include <iostream> //Necessary for std::cout
#include <iomanip> //Necessary for std::setprecision
#include <ctime> //Necessary for clock_t
using namespace std;
double* gpu_run(double * h_bound, double * h_resultArr, int SIZE);
int main() {
double step = 0.5, upper = 0, result = 0;
double * h_bound = NULL, * h_resultArr = NULL;
int SIZE = 0;
cout << "Enter the upper bound: ";
cin >> upper;
SIZE = upper/step + 1; //The number of bounds, which is one more than the number of integration times
h_bound = new double[SIZE];
h_resultArr = new double[SIZE-1];
for (int i = 0; i < SIZE; i++){
h_bound[i] = i*step;
}
clock_t t = clock();
h_resultArr = gpu_run(h_bound, h_resultArr, SIZE);
for (int i = 0; i < SIZE; i++){
result += h_resultArr[i];
}
t = clock() - t;
cout << "Calculation is done and took " << ((double)t)/CLOCKS_PER_SEC << " seconds." << endl;
cout << "The integral of x^2 from 0 to " << upper << ", using Simpson's Method is: " << setprecision(10) << result << endl;
return 0;
}
At this point the main method has called gpu_run which is the cuda code containing the method to do the actual calculations. When I use an upper bound of 3 for the integral (because the answer should be exactly 9), and using step sizes of 0.5, I get the bounds of 0, 0.5, 1, 1.5, 2, 2.5, and 3, as I expect. The gpu_run code is
#include <iostream>
#include <cuda_runtime.h>
#include <cstdio>
#include "gpu_run.h"
using namespace std;
double* gpu_run(double * h_bound, double * h_resultArr, int SIZE) {
double * d_bound = NULL;
cudaMalloc((void **)&d_bound, sizeof(double)*SIZE);
cudaMemcpy(d_bound, h_bound, SIZE, cudaMemcpyHostToDevice);
double * d_resultArr = NULL;
cudaMalloc((void **)&d_resultArr, sizeof(double)*(SIZE-1));
int threadsPerBlock = 256;
int blocksPerGrid =(SIZE + threadsPerBlock - 1) / threadsPerBlock;
simpsons<<<blocksPerGrid, threadsPerBlock>>>(d_bound, d_resultArr, SIZE);
cudaMemcpy(h_resultArr, d_resultArr, SIZE-1, cudaMemcpyDeviceToHost);
return h_resultArr;
}
In this program the d_ prescript is used to designate the arrays which exist on the GPU, the values stored in h_bound are still correct here. Lastly, I have the header for the CUDA method simpsons which is called
__global__ void simpsons(double * bound, double * resultArr, int SIZE){
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < SIZE-1){
double a = bound[i];
double b = bound[i+1];
printf("i: %d lower bound: %d upper bound: %d \n", i, bound[i], bound[i+1]);
resultArr[i] = ((b-a)/6)*(a*a + (a+b)*(a+b) + b*b);
}
}
For each processor I need it to access the two bounds of its respective "box" under the function and use simpson's method at those two bounds to calculate the area, however, the values in the bound array in this method are nonsense values. What am I doing wrong? I feel like it is a really stupid mistake, but I just can't seem to find it.