I want to implement a simple matrix multiplication in CUDA. The dimensions of the matrix are determined at runtime and I also want to use the shared memory in order to gain a perfomance boost. I have implemented such a function but everytime I run it, I get this error:
mulKernel launch failed: an illegal memory access was encountered
I am also not sure if I can use malloc to allocate shared memory. However, if I want to use something like this
__shared__ float matrM_sm[tile_width][tile_width];
the compiler complains that tile_width has to be known at runtime...
I have tried everything I can think of and tried various suggestions as well but none of them worked. This is the function (the full working file can be found HERE):
__global__ void mulKernelSM(float *matrR, const float *matrM, const float *matrN,
const int m_x, const int m_y, const int n_x, const int n_y, const int tile_width)
{
int i, j;
extern __shared__ float shared[];
float *matrM_sm = shared;
float *matrN_sm = &shared[tile_width * tile_width];
int bx = blockIdx.x;
int by = blockIdx.y;
int tx = threadIdx.x;
int ty = threadIdx.y;
int row = by * tile_width + ty;
int col = bx * tile_width + tx;
float tmp;
int limit = ceil(m_y / (float) tile_width);
for (i = 0; i < limit; i++)
{
tmp = 0.0;
if (i * tile_width + tx < m_y && row < m_x)
matrM_sm[ty * tile_width + tx] = matrM[row * m_y + (i * tile_width + tx)];
else
matrM_sm[ty * tile_width + tx] = 0.0;
if (i * tile_width + ty < n_x && col < n_y)
matrN_sm[ty * tile_width + tx] = matrN[col + (i * tile_width + ty) * n_y];
else
matrN_sm[ty * tile_width + tx] = 0.0;
__syncthreads();
for (j = 0; j < tile_width; j++)
tmp += matrM_sm[ty * tile_width + j] * matrN_sm[j * tile_width + tx];
__syncthreads();
}
if (row < m_x && col < n_y)
matrR[row * n_y + col] = tmp;
}
The basic layout should work as I have also implemented a version without shared memory which works just fine. The function without shared memory is listed below:
__global__ void mulKernel(float *matrR, const float *matrM, const float *matrN,
const int m_x, const int m_y, const int n_x, const int n_y)
{
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
int i;
if ((row < m_x) && (col < n_y))
{
float tmp = 0.0;
for (i = 0; i < m_y; i++)
{
tmp += matrM[row * m_y + i] * matrN[col + n_y * i];
}
matrR[row * n_y + col] = tmp;
}
}
If there is any information missing I will provide it immediatly.