There are instrinsics for this, but the functions available for C99 should be fine:
isnan()
To test for inf, you can use:
isinf()
It's rarely faster to have multiple kernels do the same work of a single well written kernel, so I'm not sure why you think having a single kernel would be slow. This algorithm is likely to be memory-bound, so you want to focus on read data access efficiency, i.e. coalescing. In CUDA, the easy way to go through a matrix is to have each thread handle a column. This can be implemented efficiently with a for-loop and results in perfectly coalesced reads.
Since you only care about a single result with no indices, we can have multiple threads writing to a (boolean) result without atomics, for further efficiency, since any threads that might be writing to the result would all be writing the same value.
Another optimization strategy one might consider would be an early-exit strategy, but this does not optimize the worst-case timing, but in fact makes it longer, so I would dispense with that unless average throughput is a big issue.
Here's a complete worked example (using test for nan as an example):
$ cat t383.cu
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#define DSIZEW 10000
#define DSIZEH 2000
#define nTPB 256
#define BLKS 16
__global__ void isnan_test(float *data, int width, int height, bool *result){
int idx = threadIdx.x+blockDim.x*blockIdx.x;
while (idx < width){
for (int i = 0; i < height; i++)
if (isnan(data[(i*width) + idx])) *result = false;
idx += gridDim.x+blockDim.x;
}
}
int main(){
float *d_data, *h_data;
bool *d_result, h_result=true;
const char type = '0';
cudaMalloc((void **)&d_data, sizeof(float)*DSIZEW*DSIZEH);
cudaMalloc((void **)&d_result, sizeof (bool));
h_data=(float *)malloc(sizeof(float)*DSIZEW*DSIZEH);
for (int i=0; i<DSIZEH*DSIZEW; i++)
h_data[i] = rand()/RAND_MAX;
cudaMemcpy(d_data, h_data, sizeof(float)*DSIZEW*DSIZEH, cudaMemcpyHostToDevice);
cudaMemcpy(d_result, &h_result, sizeof(bool), cudaMemcpyHostToDevice);
isnan_test<<<BLKS,nTPB>>>(d_data, DSIZEW, DSIZEH, d_result);
cudaMemcpy(&h_result, d_result, sizeof(bool), cudaMemcpyDeviceToHost);
if (!h_result) {printf("error in no-NAN check\n"); return 1;}
float my_nan = nanf(&type); // create a NAN value
cudaMemcpy(d_data, &my_nan, sizeof(float), cudaMemcpyHostToDevice);
isnan_test<<<BLKS,nTPB>>>(d_data, DSIZEW, DSIZEH, d_result);
cudaMemcpy(&h_result, d_result, sizeof(bool), cudaMemcpyDeviceToHost);
if (h_result) {printf("error in NAN check\n"); return 1;}
printf("Success\n");
return 0;
}
$ nvcc -arch=sm_20 -o t383 t383.cu
$ ./t383
Success
$
Note that I have dispensed with proper cuda error checking for clarity/brevity, but that is always recommended.
For further optimization, you can play with the blocks per grid parameter (BLKS
) and the threads per block parameter (nTPB
), however, to some degree the optimal values of these will depend on which GPU you are running on.