Given a n-by-m matrix, I would like to build a n-sized vector containing the minimums of each matrix row, in CUDA.
So far I've come through this:
__global__ void OnMin(float * Mins, const float * Matrix, const int n, const int m) {
int i = threadIdx.x + blockDim.x * blockIdx.x;
if (i < n) {
Mins[i] = Matrix[m * i];
for (int j = 1; j < m; ++j){
if (Matrix[m * i + j] < Mins[i])
Mins[i] = Matrix[m * i + j];
}
}
}
called in:
OnMin<<<(n + TPB - 1) / TPB, TPB>>>(Mins, Matrix, n, m);
However I think that something more optimized could exist.
I tried invoking cublasIsamin
in a loop, but it is slower.
I also tried launching a kernel (global) from OnMin kernel without success... (sm_35, compute_35 raises compile errors... I have a GTX670)
Any ideas ?
Thanks!