CUDA how to get grid, block, thread size and parallalize non square matrix calculation

Question

I am new to CUDA and need help understanding some things. I need help parallelizing these two for loops. Specifically how to setup the dimBlock and dimGrid to make this run faster. I know this looks like the vector add example in the sdk but that example is only for square matrices and when I try to modify that code for my 128 x 1024 matrix it doesn't work properly.

__global__ void mAdd(float* A, float* B, float* C)
{
    for(int i = 0; i < 128; i++)
    {
        for(int j = 0; j < 1024; j++)
        {
            C[i * 1024 + j] = A[i * 1024 + j] + B[i * 1024 + j];
        }
    }
}

This code is part of a larger loop and is the simplest portion of the code, so I decided to try to paralleize thia and learn CUDA at same time. I have read the guides but still do not understand how to get the proper no. of grids/block/threads going and use them effectively.

In [pycuda](http://mathema.tician.de/software/pycuda) it is just `C[i] = A[i] + B[i]` [demo.py](https://gist.github.com/916835) — jfs, Apr 13 '11 at 02:06

talonmies · Accepted Answer · 2019-12-09T06:25:55.347

As you have written it, that kernel is completely serial. Every thread launched to execute it is going to performing the same work.

The main idea behind CUDA (and OpenCL and other similar "single program, multiple data" type programming models) is that you take a "data parallel" operation - so one where the same, largely independent, operation must be performed many times - and write a kernel which performs that operation. A large number of (semi)autonomous threads are then launched to perform that operation across the input data set.

In your array addition example, the data parallel operation is

C[k] = A[k] + B[k];

for all k between 0 and 128 * 1024. Each addition operation is completely independent and has no ordering requirements, and therefore can be performed by a different thread. To express this in CUDA, one might write the kernel like this:

__global__ void mAdd(float* A, float* B, float* C, int n)
{
    int k = threadIdx.x + blockIdx.x * blockDim.x;

    if (k < n)
        C[k] = A[k] + B[k];
}

[disclaimer: code written in browser, not tested, use at own risk]

Here, the inner and outer loop from the serial code are replaced by one CUDA thread per operation, and I have added a limit check in the code so that in cases where more threads are launched than required operations, no buffer overflow can occur. If the kernel is then launched like this:

const int n = 128 * 1024;
int blocksize = 512; // value usually chosen by tuning and hardware constraints
int nblocks = n / blocksize; // value determine by block size and total work

madd<<<nblocks,blocksize>>>mAdd(A,B,C,n);

Then 256 blocks, each containing 512 threads will be launched onto the GPU hardware to perform the array addition operation in parallel. Note that if the input data size was not expressible as a nice round multiple of the block size, the number of blocks would need to be rounded up to cover the full input data set.

All of the above is a hugely simplified overview of the CUDA paradigm for a very trivial operation, but perhaps it gives enough insight for you to continue yourself. CUDA is rather mature these days and there is a lot of good, free educational material floating around the web you can probably use to further illuminate many of the aspects of the programming model I have glossed over in this answer.

int k = threadIdx.x + gridDim.x * blockDim.x; This is surely incorrect? gridDim.x * blockDim.x in your example will always be 256*512. Should be int k = threadIdx.x + blockIdx.x * blockDim.x; I tried to edit it but was rejected. — Ozone, May 01 '13 at 03:02
Warning to the skim-reader: nblocks = ceil(n / nthreads); // if your data doesn't divide perfectly. — ofer.sheffer, Apr 04 '17 at 11:01
@ofer.sheffer: I did write "Note that if the input data size was not expressible as a nice round multiple of the block size, the number of blocks would need to be rounded up to cover the full input data set.". Is that not clear enough? — talonmies, Apr 04 '17 at 11:02
@talonmies, Your answer is very nice and I upvoted it. On the other hand, as I was reading it I was thinking "he's missed the +1" in case the data does not divide evenly... then I went on to read a few other things and came back here to finish up the reading and I noticed that you wrote it in. As a skim reader who usually just looks at the code first and considers reading every word later -- I figure my warning would help my future self. — ofer.sheffer, Apr 04 '17 at 12:50
How do I know `nthreads`? Is not `blocksize` the number of threads? — smcs, Nov 07 '19 at 16:45

CUDA how to get grid, block, thread size and parallalize non square matrix calculation

1 Answers1

Linked

Related