0

I've written a CUDA C program to parallelize matrix multiplication by having each thread calculate a row of the result matrix. I've stored my matrices as 1D arrays in row-major form. I can't seem to find anywhere why my program shouldn't be working, be it issues with pointers or the kernel code. Help will be appreciated, thanks!

Code:

#include <cuda.h>
#include <time.h>
#include <stdlib.h>

__global__ void multiplyMatricesKernel(float* d_x, float* d_y, float* d_z, int m, int n, int p)
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    if(i < m)
    {
        for(int j = 0; j < p; ++j)
        {
            d_z[i * p + j] = 0;
            for(int k = 0; k < m; ++k)
            {
                d_z[i * p + j] += d_x[i * n + k] * d_y[k * p + j];
            }
        }
    }
}

void multiplyMatrices(float* x, float* y, float* z, int m, int n, int p)
{

    int elements_x = m * n * sizeof(float);
    int elements_y = n * p * sizeof(float);
    int elements_z = m * p * sizeof(float);

    float* d_x;
    float* d_y;
    float* d_z;

    cudaMalloc((void**) &d_x, elements_x);
    cudaMalloc((void**) &d_y, elements_y);
    cudaMalloc((void**) &d_z, elements_z);

    cudaMemcpy(d_x, x, elements_x, cudaMemcpyHostToDevice); 
    cudaMemcpy(d_y, y, elements_y, cudaMemcpyHostToDevice);

    multiplyMatricesKernel<<<ceil(m / 64.0), 64>>>(d_x, d_y, d_z, m, n, p);

    cudaMemcpy(z, d_z, elements_z, cudaMemcpyDeviceToHost);

    cudaFree(d_x);
    cudaFree(d_y);
    cudaFree(d_z);
}

int main()
{
    srand(time(NULL));
    int m = rand() % 8 + 1;
    int n = rand() % 8 + 1;
    int p = rand() % 8 + 1;

    float x[m * n] = {0};
    float y[n * p] = {0};
    float z[m * p] = {0};

    printf("X =\n[");
    for(int i = 0; i < sizeof(x) / sizeof(float); ++i)
    {
        x[i] = rand() % 129 - 64;
        printf("%.1f ", x[i]);
        if((i + 1) % n == 0 && i != (sizeof(x) / sizeof(float) - 1))
        {
            printf("]\n[");
        }
        if(i == (sizeof(x) / sizeof(float) - 1))
        {
            printf("]\n\n");
        }
    }
    
    printf("Y = \n[");
    for(int i = 0; i < sizeof(y) / sizeof(float); ++i)
    {
        y[i] = rand() % 129 - 64;
        printf("%.1f ", y[i]);
        if((i + 1) % p == 0 && i != (sizeof(y) / sizeof(float) - 1))
        {
            printf("]\n[");
        }
        if(i == (sizeof(y) / sizeof(float) - 1))
        {
            printf("]\n\n");
        }
    }

    multiplyMatrices(x, y, z, m, n, p);

    printf("Z = \n[");
    for(int i = 0; i < sizeof(z) / sizeof(float); ++i)
    {   
        printf("%.1f ", z[i]);
        if((i + 1) % p == 0 && i != (sizeof(z) / sizeof(float) - 1))
        {
            printf("]\n[");
        }
        if(i == (sizeof(z) / sizeof(float) - 1))
        {
            printf("]\n\n");
        }
    }
    return 0;
}
catfood
  • 345
  • 3
  • 14

3 Answers3

2

Regarding your kernel, there is a mistake in dimentioning y and z

    int elements_x = m * n * sizeof(float);
    int elements_y = n * p * sizeof(float);
    int elements_z = m * p * sizeof(float);
  1. x is m x n
  2. y is m x p, not n x p
  3. z is n x p, not m x p This kind of error is very difficult to find...
YLS
  • 73
  • 7
  • No, that's incorrect. My dimensioning is fine - matrix multiplication requires that the second matrix has as many rows as the first matrix has as many columns. – catfood Oct 02 '20 at 13:35
  • 1
    in the internal loop in your kernel you write d_z[i * p + j] += d_x[i * n + k] * d_y[k * p + j]; so d_y is mxp since k is in the range 0-m-1 – YLS Oct 02 '20 at 13:41
  • 1
    You have a dimension problem. If you run your code enough times with `cuda-memcheck` you can eventually prove it to yourself. For a matrix multiplication of `mxn` times `nxp` matrices, the result would be of size `mxp`, so your dimensions appear correct, but with that sort of problem, *the innermost for-loop in the matrix multiply should iterate over length `n`, not `m`*. So that is definitely a problem. You may also have other problems with your machine setup, which could be discovered with [proper CUDA error checking](https://stackoverflow.com/questions/14038589/). – Robert Crovella Oct 02 '20 at 13:46
  • Yes, thank you @RobertCrovella for pointing that out - I was mulling over how to actually make the kernel code for a couple hours - that error inevitably slipped in there. I'll do the proper CUDA error checking thing - see if that helps me. Z still outputs as an array of 0s unfortunately. – catfood Oct 02 '20 at 13:56
  • 1
    I suspect your output of zeros is probably due to another issue - possibly a machine setup issue. From my perspective, this is rather odd kernel code, although I'm sure it could be made to work. I didn't try to identify all the issues (there may be others), but that one seemed obvious to me. A canonical example of (naive) matrix multiplication on the GPU is given in the [CUDA programming guide](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory). – Robert Crovella Oct 02 '20 at 13:58
  • I am a CUDA beginner - I've based my kernel code off of kernel code for vector addition, from a 2013 CUDA textbook (Programming Massively Processors by Wen-Mei Hwu and David B Kirk, 2nd Edition). The code they used there was exactly the same as the kernel code I've used here - just that the matrix multiplying part is change to a single addition of elements for each thread. – catfood Oct 02 '20 at 14:05
  • Googling reveals one of your answers to something else - how do I compile like mentioned in this: https://forums.developer.nvidia.com/t/the-kernel-always-returns-values-equal-to-zero/35565/4 (from the Linux command line) @RobertCrovella – catfood Oct 02 '20 at 14:11
  • This sounds like a different question/issue, and actually I don't understand what question you are asking. Are you asking how to compile from the linux command line targeting a specifc GPU architecture? – Robert Crovella Oct 02 '20 at 14:34
  • @RobertCrovella Nevermind, I managed to fix it - I'll put my solution in my answer by editing it. – catfood Oct 02 '20 at 19:45
2

The problem is in this line inside your kernel.

for(int k = 0; k < m; ++k)

It should be n not m, as each element is the sum of n multiplications, like this:

for(int k = 0; k < n; ++k)

I want to add also that you can make the block of size 1024 not only 64.

multiplyMatricesKernel<<<ceil(m / 1024.0), 1024>>>(d_x, d_y, d_z, m, n, p);

Finally, you can make it even faster by calculating all elements in parallel, not only the rows.

AbdelAziz AbdelLatef
  • 3,650
  • 6
  • 24
  • 52
2

I managed to solve my issue. Turned out that because I hadn't rebooted my Arch Linux system after a kernel package update, the necessary nvidia module needed to do GPU memory allocation and GPU to CPU memory transfers wasn't loaded (nvidia_uvm was the module name). A restart of my system did the trick. Thanks for all your help, especially Robert Crovella and AbdelAziz AbdelLatef for pointing out my iterative error in my kernel. Thanks!

catfood
  • 345
  • 3
  • 14