Matrix multiplication in CUDA of variable matrix sizes and the use of shared memory

Question

I want to implement a simple matrix multiplication in CUDA. The dimensions of the matrix are determined at runtime and I also want to use the shared memory in order to gain a perfomance boost. I have implemented such a function but everytime I run it, I get this error:

mulKernel launch failed: an illegal memory access was encountered

I am also not sure if I can use malloc to allocate shared memory. However, if I want to use something like this

__shared__ float matrM_sm[tile_width][tile_width];

the compiler complains that tile_width has to be known at runtime...

I have tried everything I can think of and tried various suggestions as well but none of them worked. This is the function (the full working file can be found HERE):

__global__ void mulKernelSM(float *matrR, const float *matrM, const float *matrN,
   const int m_x, const int m_y, const int n_x, const int n_y, const int tile_width)
{
    int i, j;
    extern __shared__ float shared[];
    float *matrM_sm = shared;
    float *matrN_sm = &shared[tile_width * tile_width];

    int bx = blockIdx.x;
    int by = blockIdx.y;
    int tx = threadIdx.x;
    int ty = threadIdx.y;

    int row = by * tile_width + ty;
    int col = bx * tile_width + tx;

    float tmp;
    int limit = ceil(m_y / (float) tile_width);
    for (i = 0; i < limit; i++)
    {
        tmp = 0.0;

        if (i * tile_width + tx < m_y && row < m_x)
            matrM_sm[ty * tile_width + tx] = matrM[row * m_y + (i * tile_width + tx)];
        else
            matrM_sm[ty * tile_width + tx] = 0.0;

        if (i * tile_width + ty < n_x && col < n_y)
            matrN_sm[ty * tile_width + tx] = matrN[col + (i * tile_width + ty) * n_y];
        else
            matrN_sm[ty * tile_width + tx] = 0.0;

        __syncthreads();

        for (j = 0; j < tile_width; j++)
            tmp += matrM_sm[ty * tile_width + j] * matrN_sm[j * tile_width + tx];

        __syncthreads();
    }

    if (row < m_x && col < n_y)
        matrR[row * n_y + col] = tmp;
}

The basic layout should work as I have also implemented a version without shared memory which works just fine. The function without shared memory is listed below:

__global__ void mulKernel(float *matrR, const float *matrM, const float *matrN,
    const int m_x, const int m_y, const int n_x, const int n_y)
{
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    int i;

    if ((row < m_x) && (col < n_y))
    {
        float tmp = 0.0;
        for (i = 0; i < m_y; i++)
        {
            tmp += matrM[row * m_y + i] * matrN[col + n_y * i];
        }

        matrR[row * n_y + col] = tmp;
    }
}

If there is any information missing I will provide it immediatly.

You cannot use malloc to allocate shared memory. The programming guide discusses shared memory allocation, and there are numerous CUDA sample codes as well as questions here on SO that cover proper shared memory usage. If you want to have a runtime allocatable shared memory size, you use the dynamic shared memory allocation method with `extern` and providing the shared memory size as a kernel launch parameter. If you want help debugging a code, you are supposed to provide a [mcve]. A CUDA kernel, by itself, is not a MCVE. — Robert Crovella, Nov 15 '16 at 20:03
Thank you for your suggestion. I integrated the shared memory allocation with extern and updated the question so the new code is up to date. I also added a link to the entire code which represents a full and working example. It is at the beginning, before the function starts. — JRsz, Nov 15 '16 at 20:36
The parameter to specify shared memory size in the kernel call must be specified in **bytes**. That is at least one reason for the error "an illegal memory access was encountered". Fix that and run your code with `cuda-memcheck`. If errors are still reported, use the method described [here](http://stackoverflow.com/questions/27277365/unspecified-launch-failure-on-memcpy/27278218#27278218) to help localize the error to a specific kernel line. Then use `printf` or another debugging method if you need more info. — Robert Crovella, Nov 15 '16 at 20:46
I see, I reserved some memory but forgot to multiply it with the size of a float. Now it works without an error, though the calculation is not correct yet. I have read about memcheck several times but could not figure out how to use it in Visual Studio on windows. I found no option for it nor an executable on my PC. According to the manual both should be possible but I was not able to follow either of the instructions... — JRsz, Nov 15 '16 at 20:52
I further updated the code in pastebin to the current step. If any of you could now help my find the error in the calculaction I would be immensly happy. Btw. if you make your comment to an answer, Robert Crovella, I will accept it and upvote since that solved my initial problem (though not all of them, unfortunatly) — JRsz, Nov 15 '16 at 20:57

score -1 · Answer 1 · answered Nov 15 '16 at 19:08

-1

You swapped row, col. Furthermore, I believe to get the global thread index you should rather do this int x_global = threadIdx.x + blockDim.x * threadIdx.y

answered Nov 15 '16 at 19:08

Armen Avetisyan

1,140
10
29

switching col and row results in the same error and tile_width is exactly the same as blockDim.x and blockDim.y – JRsz Nov 15 '16 at 19:12
does ```tile_width``` match ```blockIdx.x``` and ```blockIdx.y```? – Armen Avetisyan Nov 15 '16 at 19:18
no, it does not since tile_width is always the same each run but differs from run to run and blockIdx.x differs – JRsz Nov 15 '16 at 19:21

Matrix multiplication in CUDA of variable matrix sizes and the use of shared memory

1 Answers1