How to implement a basic C++ 2D array loop in CUDA

Question

I'm pretty new to C++ coding and I'm currently trying to use CUDA to do some GPU computations.

Basically I have a matrix A (N by N), and a couple of vectors b and x0. b and x0 have N elements as well.

This is the piece of code I'm trying to implement:

for (unsigned i=1;i<=N;i++){
    T sum = 0;
    for (unsigned j=1;j<=N;j++){
        sum += A[j][i]*x0[j];
    }
    v[i] = b[i] - sum;
}

where T is a template variable (can be assigned a double as far as I'm aware).

Would it be possible to parallelize the whole thing, and if so how would I to that? I could also use some pointers regarding how to break up the threads in general for such a problem into blocks and how to move the 2D from the host to device and back...

Please let me know if any additional information is required.

EDIT 1: After looking into CUBLAS and not getting very far, Ive decided to flatten my matrices and write the code myself. My first discovery has been that my cuda kernel doesn't like working with double type variables/arrays [can someone confirm this?].

After converting everything to a float the cuda kernel I've written looks something like this:

__global__ void cudaMatTimesVect(float *p, float  *x0, float *v, float *sum, float *toSum, float *b, int N){

int idx = blockIdx.x * blockDim.x + threadIdx.x; // thread index

if (idx < N*N){
    toSum[idx] = p[idx] * x0[blockIdx.x];
}

__syncthreads();
if( idx-(blockIdx.x * blockDim.x) == 0){
    for(int i=0; i<blockDim.x; i++){
        sum[blockIdx.x] += toSum[idx+i];
    }

v[blockIdx.x] = b[blockIdx.x] - sum[blockIdx.x];
}

I'm not sure whether the syncthreads() command will wait for all the threads to multiply before attempting to carry out the sum loop.

Here is the snippet of CPU code regarding he sum and toSum arrays initialized only on the GPU:

float *d_sum;
float *d_toSum;
cudaError_t  cudaStatus;
...
// allocate toSum memory
cudaStatus = cudaMalloc(&d_toSum, N*N*sizeof(float));
if (cudaStatus != cudaSuccess){
    std::cout << "couldnt allocate device memory for d_toSum!" << std::endl;
    cudaFree(d_toSum);
}
// allocate sum mem on device
cudaStatus = cudaMalloc(&d_sum, N*sizeof(float));
if (cudaStatus != cudaSuccess){
    std::cout << "couldnt allocate device memory for d_sum" << std::endl;
    cudaFree(d_sum);
}

... 
...
// call the kernel
cudaMatTimesVect<<<N,N>>>(d_p, d_x0, d_v, d_sum, d_toSum, d_b, N);
...


cudaFree(d_toSum);
cudaFree(d_sum);

Is this the most efficient way to do the summation?

EDIT 2: I have now changed the code to use different block indcies to run row computations. The above kernel compiles and runs, but array elements in v seem to keep getting smaller and smaller rather than restarting...

I am still interested to understand why I can't use doubles and how my code needs to change if I want to define my host arrays using < vector >.

Thanks,

Armen

It would be very straightforward to parallelize this in CUDA. You have a matrix-vector multiplication problem followed by a vector add (subtract) problem. This question appears to be a "please write my code for me" or at a minimum "please give me a tutorial on CUDA" niether of which are appropriate. You should demonstrate some basic understanding of CUDA if you intend to ask questions about it, and should show a basic attempt to solve the problem. With a little bit of study you should be able to implement this very easily using [cublas](http://docs.nvidia.com/cuda/cublas/index.html). — Robert Crovella, Aug 20 '13 at 16:33
Specifically, the vote to close is exactly the reason given by SO: "Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results." — Robert Crovella, Aug 20 '13 at 16:35
Sorry for not inputting any of my work on this which I thought would only clutter the question.. My concern is the following: I cannot decide whether it would be better to move the 2D array to the device and do computations there using some sort of indexing, or whether to pass parts of the 2D array to the kernel (which i loop) and run simple 1D array mulitplication on the device and back to the host. — rex, Aug 20 '13 at 16:51

Robert Crovella · Accepted Answer · 2013-08-22T11:48:48.180

1

You can solve this problem in cublas:

Data is copied to the GPU using cublasSetVector or cublasSetMatrix

Results are copied back using the corresponding Get functions.

The matrix-vector multiply is handled with gemv. The vector-vector subtract is handled with axpy.

A worked cublas example is available in the cuda samples.

Based on the additional comments: There's no reason to segment the data into 1D blocks for this problem. I'm recommending cublas. But if you want to see other code examples, take a look at the vector add example and the matrix multiply example.

For the doubly-subscripted matrix on the host, you should flatten that so that you can reference the data with a single (*) pointer and indexing. This is true whether you are using cublas or writing your own code.

EDIT: responding to updates in the question. The multiplication code you've posted doesn't look like matrix-vector multiplication to me, unless you've duplicated your vector N times in length so that it matches the length of the matrix (NxN). Then it would appear to be correct.

The summing code doesn't look correct, and furthermore, since it doesn't depend on idx in any way, all your threads are doing exactly the same thing. So no parallel benefit there, and we don't normally write GPU code that way.

Your vector subtract code appears to be approximately correct, except again that you seem to be doing a vector subtract across the entire length of the matrix (NxN) when the results of the matrix-vector multiply should only have produced a vector of length N.

I'd be surprised if this code could produce results that matched your serial code for the same data sets. Did you check that it produced correct results for non-trivial data sets? (Don't use data sets where every number is the same.)

edited Aug 22 '13 at 11:48

answered Aug 20 '13 at 16:41

Robert Crovella

143,785
11
213
257

Thanks Robert. I will look into cublas. – rex Aug 20 '13 at 17:12
Hi Robert, I've editted my original post and was hoping you could give me your opinion. Thanks. – rex Aug 22 '13 at 10:56
Thanks Robert, I have updated the code now to something that appears to work (see OP). The problem is that the elements in the result (v) don't seem to reinitialize. Using N=3, and running the code multiple times, the first element keeps getting smaller and smaller. The second element reduces at a slower rate and the last element stays at -1. – rex Aug 22 '13 at 12:00
Are you initializing the `sum` variable to zero before calling this kernel? You probably need to post a complete code for a better review. Do you really think this is easier than using cublas? This can be done in about half a dozen lines of code in cublas, without writing a single line of GPU device code. – Robert Crovella Aug 22 '13 at 12:05
sum is initialized in the CPU code and then cudaFree-ed at the end of the execution. I tried implementing cublas but unfortunately my knowledge of C++ and pointers etc did not permit me to do so. – rex Aug 22 '13 at 12:18
You're not initializing your `sum` variable to zero. Yes, you are allocating storage for `sum`, but your very first operation of `sum[blockIdx.x] += toSum[idx+i];` adds to a garbage value. cublas is not a C++ API, it is a C API, and requires no knowledge of C++ to use. I'm not looking for little snippets of code to review. Post a complete, compilable code. That is the expectation of SO for questions like these. Refer to SSCCE.org – Robert Crovella Aug 22 '13 at 12:37
The reason you can't use double may be a function of which GPU you are using, which you haven't indicated. Since you've posted almost none of your host code, I don't see how anyone could tell you how it should change if your data is stored in vector containers. – Robert Crovella Aug 22 '13 at 12:39
Thanks, I have posted a new response below with the full program. I've also tried to set toSum = 0.0 in each thread but this doesnt seem to have doesnt seem to have donet he trick. – rex Aug 22 '13 at 12:51

How to implement a basic C++ 2D array loop in CUDA

1 Answers1