I'm pretty new to C++ coding and I'm currently trying to use CUDA to do some GPU computations.
Basically I have a matrix A (N by N), and a couple of vectors b and x0. b and x0 have N elements as well.
This is the piece of code I'm trying to implement:
for (unsigned i=1;i<=N;i++){
T sum = 0;
for (unsigned j=1;j<=N;j++){
sum += A[j][i]*x0[j];
}
v[i] = b[i] - sum;
}
where T is a template variable (can be assigned a double as far as I'm aware).
Would it be possible to parallelize the whole thing, and if so how would I to that? I could also use some pointers regarding how to break up the threads in general for such a problem into blocks and how to move the 2D from the host to device and back...
Please let me know if any additional information is required.
EDIT 1: After looking into CUBLAS and not getting very far, Ive decided to flatten my matrices and write the code myself. My first discovery has been that my cuda kernel doesn't like working with double type variables/arrays [can someone confirm this?].
After converting everything to a float the cuda kernel I've written looks something like this:
__global__ void cudaMatTimesVect(float *p, float *x0, float *v, float *sum, float *toSum, float *b, int N){
int idx = blockIdx.x * blockDim.x + threadIdx.x; // thread index
if (idx < N*N){
toSum[idx] = p[idx] * x0[blockIdx.x];
}
__syncthreads();
if( idx-(blockIdx.x * blockDim.x) == 0){
for(int i=0; i<blockDim.x; i++){
sum[blockIdx.x] += toSum[idx+i];
}
v[blockIdx.x] = b[blockIdx.x] - sum[blockIdx.x];
}
I'm not sure whether the syncthreads() command will wait for all the threads to multiply before attempting to carry out the sum loop.
Here is the snippet of CPU code regarding he sum and toSum arrays initialized only on the GPU:
float *d_sum;
float *d_toSum;
cudaError_t cudaStatus;
...
// allocate toSum memory
cudaStatus = cudaMalloc(&d_toSum, N*N*sizeof(float));
if (cudaStatus != cudaSuccess){
std::cout << "couldnt allocate device memory for d_toSum!" << std::endl;
cudaFree(d_toSum);
}
// allocate sum mem on device
cudaStatus = cudaMalloc(&d_sum, N*sizeof(float));
if (cudaStatus != cudaSuccess){
std::cout << "couldnt allocate device memory for d_sum" << std::endl;
cudaFree(d_sum);
}
...
...
// call the kernel
cudaMatTimesVect<<<N,N>>>(d_p, d_x0, d_v, d_sum, d_toSum, d_b, N);
...
cudaFree(d_toSum);
cudaFree(d_sum);
Is this the most efficient way to do the summation?
EDIT 2: I have now changed the code to use different block indcies to run row computations. The above kernel compiles and runs, but array elements in v seem to keep getting smaller and smaller rather than restarting...
I am still interested to understand why I can't use doubles and how my code needs to change if I want to define my host arrays using < vector >.
Thanks,
Armen