1

What is an efficient way to check a large matrix for inf/nan elements in CUDA (C++)? The matrix is stored as float* in the GPU memory. I don't need the location of those elements, just a boolean yes/no answer if at least one bad entry is present.

The options are:

  • have one kernel check the whole array (easy to implement but probably slow)
  • have multiple kernels check e.g. the rows and combine the output with OR (are there any CUDA builtins for doing this efficiently?)
  • ..other ideas?

Thanks!

Vitality
  • 20,705
  • 4
  • 108
  • 146
Pavel
  • 7,436
  • 2
  • 29
  • 42
  • Letting one kernel check one row seems a reasonable compromise between effiicency and ease of implementation to me. But I do OpenCL, which isn't quite the same as CUDA. – Mats Petersson Mar 29 '14 at 17:23
  • You're likely to get better results if you check the values as you generate them, which I imagine is already split between cores anyway. – Dave Mar 29 '14 at 17:30

2 Answers2

6

There are instrinsics for this, but the functions available for C99 should be fine:

isnan()

To test for inf, you can use:

isinf()

It's rarely faster to have multiple kernels do the same work of a single well written kernel, so I'm not sure why you think having a single kernel would be slow. This algorithm is likely to be memory-bound, so you want to focus on read data access efficiency, i.e. coalescing. In CUDA, the easy way to go through a matrix is to have each thread handle a column. This can be implemented efficiently with a for-loop and results in perfectly coalesced reads.

Since you only care about a single result with no indices, we can have multiple threads writing to a (boolean) result without atomics, for further efficiency, since any threads that might be writing to the result would all be writing the same value.

Another optimization strategy one might consider would be an early-exit strategy, but this does not optimize the worst-case timing, but in fact makes it longer, so I would dispense with that unless average throughput is a big issue.

Here's a complete worked example (using test for nan as an example):

$ cat t383.cu
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#define DSIZEW 10000
#define DSIZEH 2000
#define nTPB 256
#define BLKS 16

__global__ void isnan_test(float *data, int width, int height, bool *result){

  int idx = threadIdx.x+blockDim.x*blockIdx.x;

  while (idx < width){
    for (int i = 0; i < height; i++)
      if (isnan(data[(i*width) + idx])) *result = false;
    idx += gridDim.x+blockDim.x;
    }
}

int main(){

  float *d_data, *h_data;
  bool  *d_result, h_result=true;
  const char type = '0';

  cudaMalloc((void **)&d_data, sizeof(float)*DSIZEW*DSIZEH);
  cudaMalloc((void **)&d_result, sizeof (bool));
  h_data=(float *)malloc(sizeof(float)*DSIZEW*DSIZEH);
  for (int i=0; i<DSIZEH*DSIZEW; i++)
    h_data[i] = rand()/RAND_MAX;
  cudaMemcpy(d_data, h_data, sizeof(float)*DSIZEW*DSIZEH, cudaMemcpyHostToDevice);
  cudaMemcpy(d_result, &h_result, sizeof(bool), cudaMemcpyHostToDevice);
  isnan_test<<<BLKS,nTPB>>>(d_data, DSIZEW, DSIZEH, d_result);
  cudaMemcpy(&h_result, d_result, sizeof(bool), cudaMemcpyDeviceToHost);
  if (!h_result) {printf("error in no-NAN check\n"); return 1;}
  float my_nan = nanf(&type); // create a NAN value
  cudaMemcpy(d_data, &my_nan, sizeof(float), cudaMemcpyHostToDevice);
  isnan_test<<<BLKS,nTPB>>>(d_data, DSIZEW, DSIZEH, d_result);
  cudaMemcpy(&h_result, d_result, sizeof(bool), cudaMemcpyDeviceToHost);
  if (h_result) {printf("error in NAN check\n"); return 1;}
  printf("Success\n");
  return 0;
}


$ nvcc -arch=sm_20 -o t383 t383.cu
$ ./t383
Success
$

Note that I have dispensed with proper cuda error checking for clarity/brevity, but that is always recommended.

For further optimization, you can play with the blocks per grid parameter (BLKS) and the threads per block parameter (nTPB), however, to some degree the optimal values of these will depend on which GPU you are running on.

Community
  • 1
  • 1
Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • The type-generic functions `isinf()` and `isnan()`, as specified by the C99 and C++ standards, should work fine in device code, I don't think there is a need to drop down to the underlying type-specific device intrinsics. – njuffa Mar 30 '14 at 20:07
3

Your problem can be recast as a reduction operation. This can be effectively implemented by using CUDA Thrust. You can transform the original array to a boolean array by using CUDA's isnan or isinf and then reducing the transformed array. All that can be performed by expoiting thrust::transform_reduce.

Below is an example, constructed around the one that Robert Crovella has already presented to you. The code below implements in CUDA the equivalent of Matlab's sum(isnan(array)).

#include <thrust\device_vector.h>
#include <thrust\reduce.h>

#define DSIZEW 10000
#define DSIZEH 2000

// --- Operator for testing nan values
struct isnan_test { 
    __host__ __device__ bool operator()(const float a) const {
        return isnan(a);
    }
};

void main(){

    thrust::host_vector<float> h_data(DSIZEW*DSIZEH);
    for (int i=0; i<DSIZEH*DSIZEW; i++)
        h_data[i] = rand()/RAND_MAX;

    const char type = '0';
    float my_nan = nanf(&type); // create a NAN value
    h_data[0] = my_nan;

    thrust::device_vector<float> d_data(h_data);

    bool h_result = thrust::transform_reduce(d_data.begin(), d_data.end(), isnan_test(), 0, thrust::plus<bool>());
    printf("Result = %d\n",h_result);

    getchar();

}
Vitality
  • 20,705
  • 4
  • 108
  • 146