0

I've found some code about a cuda matrix vector product in a previous topic : Matrix-vector multiplication in CUDA: benchmarking & performance I was firstly wondering why the author didn't used shared memory for dA (the matrix) ?

And then, why the column major ordering is faster than row major ordering ?

Here is the code :

    template<typename T>
__global__ void matvec_kernel(const T * __restrict__ dA, const T * __restrict__ dx, T * __restrict__ dy, const unsigned int nRows, const unsigned int nCols)
{
    const unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;

    __shared__ T x_shared[BLOCK_SIZE];

    T y_val = 0.0;

    #pragma unroll
    for (unsigned int m = 0; m < ((nCols + BLOCK_SIZE - 1)/ BLOCK_SIZE); ++m)
    {
        if ((m * BLOCK_SIZE + threadIdx.x) <  nCols) x_shared[threadIdx.x] = dx[threadIdx.x + m * BLOCK_SIZE];
        else                                         x_shared[threadIdx.x] = 0.f;
        __syncthreads();

        #pragma unroll
        for (unsigned int e = 0; e < BLOCK_SIZE; ++e) {
            // --- Column-major ordering - faster
            y_val += dA[tid + (e + BLOCK_SIZE * m) * nRows] * x_shared[e];
            // --- Row-major ordering - slower
            //y_val += dA[tid * nCols + (e + BLOCK_SIZE * m)] * x_shared[e];
        }

        __syncthreads();
    }

    if (tid < nRows) dy[tid] = y_val;

}

I'm thinking on these two questions for 1 day now, and that's why i'm here.

Thanks a lot !

Community
  • 1
  • 1

1 Answers1

1

Shared memory here works as a cache. The components of the vector will be read multiple times, but the components of the matrix will be read only once during the calculation. That's why the code only cache the vector but not the matrix.

Column-major matrix is faster because when reading the matrix, the threads are organized along the matrix columns. Col-major thus ensures the coalesced global memory access. If the matrix is row-major, the CUDA kernel should be implemented in a different way to achieve maximum performance.

kangshiyin
  • 9,681
  • 1
  • 17
  • 29
  • So in order to achieve maximum performance with row major i need to use threadIdx.y and nRows instead of threadIdx.x / nCols (during the matrix reading phase)? – Titouan Parcollet Jul 28 '16 at 19:05
  • @TitouanParcollet No. It will be quite different from the above kernel. The above one uses one *thread* per matrix row, which actually is not optimal in terms of performance, unless the matrix is extremely large. For row-major matrix, you could use one *thread block* per matrix row, and use parallel reduction to calculate the row sum. – kangshiyin Jul 28 '16 at 19:28
  • Well i'm quite new to GPGPU and CUDA ... I'll do my best, thanks for these answers. But it's quite strange that i can't find out a "standard" matrix-vector product kernel. Isn't it possible to make a standard simple kernel doing this ? – Titouan Parcollet Jul 28 '16 at 19:43
  • Yes i've heard about cuBLAS but in fact i can't use it .. I need to built my own Neural Network based on real numbers to compare to another one which will use hyper-complex numbers. In order to achieve this goal, i need to now every single optimisation on the code to compare both Neural Network ... Using cuBLAS i won't be able to do this, cause i can't use cuBLAS and hyper-complex numbers :/ – Titouan Parcollet Jul 28 '16 at 19:48