0

Given a n-by-m matrix, I would like to build a n-sized vector containing the minimums of each matrix row, in CUDA.

So far I've come through this:

__global__ void OnMin(float * Mins, const float * Matrix, const int n, const int m) {
    int i = threadIdx.x + blockDim.x * blockIdx.x;
    if (i < n) {
        Mins[i] = Matrix[m * i];
        for (int j = 1; j < m; ++j){
            if (Matrix[m * i + j] < Mins[i])
                Mins[i] = Matrix[m * i + j];
        }
    }
}

called in:

OnMin<<<(n + TPB - 1) / TPB, TPB>>>(Mins, Matrix, n, m);

However I think that something more optimized could exist.

I tried invoking cublasIsamin in a loop, but it is slower.

I also tried launching a kernel (global) from OnMin kernel without success... (sm_35, compute_35 raises compile errors... I have a GTX670)

Any ideas ?

Thanks!

  • 1
    Maybe your could improve it (I supposed it's memory bound since there are basically no math to do) by transposing your matrix first. The issue is that memory access pattern is not coalescent at all (threads in a given warp read consecutive entries of a column, wich are not at all consecutive in memory). Also, have you tried looking at nvvp's result ? – Shawn Aug 06 '16 at 21:38
  • 1
    Possible duplicate of [Reduce matrix rows with CUDA](http://stackoverflow.com/questions/17862078/reduce-matrix-rows-with-cuda) – Pavan Yalamanchili Aug 07 '16 at 07:35

1 Answers1

1

Finding the min of array rows in a row-major matrix is a parallel reduction question that has been discussed many times on stack overflow. For exmaple, this one.

Reduce matrix rows with CUDA

The basic idea is to use n blocks in a grid. Each block contains a fixed number of threads, typically 256. Each block of threads will do the parallel reduction on a row of the m elements to find the minimum collaboratively.

For a large enough matrix where the GPU can be fully utilized, the performance upper bound is half the time of copying the matrix once.

Community
  • 1
  • 1
kangshiyin
  • 9,681
  • 1
  • 17
  • 29