Can anybody tell me why my code is returning my Z array to be 0? (CUDA C)

Question

I've written a CUDA C program to parallelize matrix multiplication by having each thread calculate a row of the result matrix. I've stored my matrices as 1D arrays in row-major form. I can't seem to find anywhere why my program shouldn't be working, be it issues with pointers or the kernel code. Help will be appreciated, thanks!

Code:

#include <cuda.h>
#include <time.h>
#include <stdlib.h>

__global__ void multiplyMatricesKernel(float* d_x, float* d_y, float* d_z, int m, int n, int p)
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    if(i < m)
    {
        for(int j = 0; j < p; ++j)
        {
            d_z[i * p + j] = 0;
            for(int k = 0; k < m; ++k)
            {
                d_z[i * p + j] += d_x[i * n + k] * d_y[k * p + j];
            }
        }
    }
}

void multiplyMatrices(float* x, float* y, float* z, int m, int n, int p)
{

    int elements_x = m * n * sizeof(float);
    int elements_y = n * p * sizeof(float);
    int elements_z = m * p * sizeof(float);

    float* d_x;
    float* d_y;
    float* d_z;

    cudaMalloc((void**) &d_x, elements_x);
    cudaMalloc((void**) &d_y, elements_y);
    cudaMalloc((void**) &d_z, elements_z);

    cudaMemcpy(d_x, x, elements_x, cudaMemcpyHostToDevice); 
    cudaMemcpy(d_y, y, elements_y, cudaMemcpyHostToDevice);

    multiplyMatricesKernel<<<ceil(m / 64.0), 64>>>(d_x, d_y, d_z, m, n, p);

    cudaMemcpy(z, d_z, elements_z, cudaMemcpyDeviceToHost);

    cudaFree(d_x);
    cudaFree(d_y);
    cudaFree(d_z);
}

int main()
{
    srand(time(NULL));
    int m = rand() % 8 + 1;
    int n = rand() % 8 + 1;
    int p = rand() % 8 + 1;

    float x[m * n] = {0};
    float y[n * p] = {0};
    float z[m * p] = {0};

    printf("X =\n[");
    for(int i = 0; i < sizeof(x) / sizeof(float); ++i)
    {
        x[i] = rand() % 129 - 64;
        printf("%.1f ", x[i]);
        if((i + 1) % n == 0 && i != (sizeof(x) / sizeof(float) - 1))
        {
            printf("]\n[");
        }
        if(i == (sizeof(x) / sizeof(float) - 1))
        {
            printf("]\n\n");
        }
    }
    
    printf("Y = \n[");
    for(int i = 0; i < sizeof(y) / sizeof(float); ++i)
    {
        y[i] = rand() % 129 - 64;
        printf("%.1f ", y[i]);
        if((i + 1) % p == 0 && i != (sizeof(y) / sizeof(float) - 1))
        {
            printf("]\n[");
        }
        if(i == (sizeof(y) / sizeof(float) - 1))
        {
            printf("]\n\n");
        }
    }

    multiplyMatrices(x, y, z, m, n, p);

    printf("Z = \n[");
    for(int i = 0; i < sizeof(z) / sizeof(float); ++i)
    {   
        printf("%.1f ", z[i]);
        if((i + 1) % p == 0 && i != (sizeof(z) / sizeof(float) - 1))
        {
            printf("]\n[");
        }
        if(i == (sizeof(z) / sizeof(float) - 1))
        {
            printf("]\n\n");
        }
    }
    return 0;
}

Don't put answers in questions. Add them as answers, if they warrant it — talonmies, Oct 06 '20 at 02:45

score 2 · Answer 1 · answered Oct 02 '20 at 13:03

2

Regarding your kernel, there is a mistake in dimentioning y and z

    int elements_x = m * n * sizeof(float);
    int elements_y = n * p * sizeof(float);
    int elements_z = m * p * sizeof(float);

x is m x n
y is m x p, not n x p
z is n x p, not m x p This kind of error is very difficult to find...

answered Oct 02 '20 at 13:03

YLS

73
7

No, that's incorrect. My dimensioning is fine - matrix multiplication requires that the second matrix has as many rows as the first matrix has as many columns. – catfood Oct 02 '20 at 13:35
1

in the internal loop in your kernel you write d_z[i * p + j] += d_x[i * n + k] * d_y[k * p + j]; so d_y is mxp since k is in the range 0-m-1 – YLS Oct 02 '20 at 13:41
1

You have a dimension problem. If you run your code enough times with `cuda-memcheck` you can eventually prove it to yourself. For a matrix multiplication of `mxn` times `nxp` matrices, the result would be of size `mxp`, so your dimensions appear correct, but with that sort of problem, *the innermost for-loop in the matrix multiply should iterate over length `n`, not `m`*. So that is definitely a problem. You may also have other problems with your machine setup, which could be discovered with [proper CUDA error checking](https://stackoverflow.com/questions/14038589/). – Robert Crovella Oct 02 '20 at 13:46
Yes, thank you @RobertCrovella for pointing that out - I was mulling over how to actually make the kernel code for a couple hours - that error inevitably slipped in there. I'll do the proper CUDA error checking thing - see if that helps me. Z still outputs as an array of 0s unfortunately. – catfood Oct 02 '20 at 13:56
1

I suspect your output of zeros is probably due to another issue - possibly a machine setup issue. From my perspective, this is rather odd kernel code, although I'm sure it could be made to work. I didn't try to identify all the issues (there may be others), but that one seemed obvious to me. A canonical example of (naive) matrix multiplication on the GPU is given in the [CUDA programming guide](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory). – Robert Crovella Oct 02 '20 at 13:58
I am a CUDA beginner - I've based my kernel code off of kernel code for vector addition, from a 2013 CUDA textbook (Programming Massively Processors by Wen-Mei Hwu and David B Kirk, 2nd Edition). The code they used there was exactly the same as the kernel code I've used here - just that the matrix multiplying part is change to a single addition of elements for each thread. – catfood Oct 02 '20 at 14:05
Googling reveals one of your answers to something else - how do I compile like mentioned in this: https://forums.developer.nvidia.com/t/the-kernel-always-returns-values-equal-to-zero/35565/4 (from the Linux command line) @RobertCrovella – catfood Oct 02 '20 at 14:11
This sounds like a different question/issue, and actually I don't understand what question you are asking. Are you asking how to compile from the linux command line targeting a specifc GPU architecture? – Robert Crovella Oct 02 '20 at 14:34
@RobertCrovella Nevermind, I managed to fix it - I'll put my solution in my answer by editing it. – catfood Oct 02 '20 at 19:45

score 2 · Answer 2 · answered Oct 02 '20 at 16:13

The problem is in this line inside your kernel.

for(int k = 0; k < m; ++k)

It should be n not m, as each element is the sum of n multiplications, like this:

for(int k = 0; k < n; ++k)

I want to add also that you can make the block of size 1024 not only 64.

multiplyMatricesKernel<<<ceil(m / 1024.0), 1024>>>(d_x, d_y, d_z, m, n, p);

Finally, you can make it even faster by calculating all elements in parallel, not only the rows.

score 2 · Accepted Answer · answered Oct 07 '20 at 17:29

I managed to solve my issue. Turned out that because I hadn't rebooted my Arch Linux system after a kernel package update, the necessary nvidia module needed to do GPU memory allocation and GPU to CPU memory transfers wasn't loaded (nvidia_uvm was the module name). A restart of my system did the trick. Thanks for all your help, especially Robert Crovella and AbdelAziz AbdelLatef for pointing out my iterative error in my kernel. Thanks!

Can anybody tell me why my code is returning my Z array to be 0? (CUDA C)

3 Answers3