What I want is a run-time "do I have a GPU?" switch so I can take one code path or the other
I believe this should be pretty straightforward.
The typical approach would be to:
Link your code statically against the CUDA Runtime library (cudart) library. If you compile with nvcc
, this is the default behavior.
(Presumably) near the beginning of your code, choose a CUDA Runtime API call such as cudaGetDevice()
. Use some form of proper CUDA error checking (always a good idea, anyway). In this case we will use the error return from this first runtime API call to make our path decision (as opposed to just simply terminating the application).
If the runtime API call in step 2 above returns cudaSuccess
(as the functional return value, not the device index), then it is safe to assume that there is at least 1 functional CUDA GPU. In that case, further inspection of the environment could be done if desired/needed, perhaps following a sequence similar to the CUDA deviceQuery
sample code. This status could be stored in your program for future decision making about code paths to follow.
If the runtime API call in step 2 returns anything other than cudaSuccess
, it almost certainly means that CUDA is non-functional, perhaps because there is no CUDA GPU. In that case, I'd advise against any further use of any CUDA API or library, and from there on your code should use host-only code paths.
Here is a fully worked example. It uses the CUFFT library to perform a simple FFT operation if a functional CUDA environment is found. Otherwise it uses FFTW to do the same thing in host code. Note that in addition to statically linking against the cudart library (the default with nvcc
, so not obvious), I am also statically linking against the CUFFT library. At least on linux as in the example here, this prevents failures at application launch time because of an inability to find dynamic libraries to link against (which would prevent the application from running at all; whereas in that case our intent would be that the application runs but chooses host code paths).
$ cat t467.cu
#include <cufft.h>
#include <fftw.h>
#include <iostream>
int main(){
double data[] = {0.0f, 1.0f, 0.0f, -1.0f, 0.0f, 1.0f, 0.0f, -1.0f};
int N = sizeof(data)/sizeof(data[0]);
int dev = 0;
if (cudaGetDevice(&dev) == cudaSuccess) {
// GPU code path
cufftDoubleComplex *din, *dout, *in, *out;
in = new cufftDoubleComplex[N];
out = new cufftDoubleComplex[N];
for (int i = 0; i < N; i++) in[i].x = data[i];
cudaError_t err = cudaMalloc(&din, sizeof(din[0]) * N);
err = cudaMalloc(&dout, sizeof(din[0]) * N);
cufftHandle plan;
cufftResult cstat = cufftPlan1d(&plan, N, CUFFT_Z2Z, 1);
cudaMemcpy(din, in, N*sizeof(din[0]), cudaMemcpyHostToDevice);
cstat = cufftExecZ2Z(plan, din, dout, CUFFT_FORWARD);
cudaMemcpy(out, dout, N*sizeof(din[0]), cudaMemcpyDeviceToHost);
for (int i = 0; i < N; i++) data[i] = out[i].x * out[i].x + out[i].y * out[i].y;
cudaFree(din); cudaFree(dout);
delete[] in; delete[] out;
cufftDestroy(plan);
std::cout << "GPU calculation: " << std::endl;
}
else {
// CPU code path
fftw_complex *in, *out;
fftw_plan p;
in = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N);
out = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N);
for (int i = 0; i < N; i++) {in[i].re= data[i]; in[i].im = 0;}
p = fftw_create_plan(N, FFTW_FORWARD, FFTW_ESTIMATE);
fftw_one(p, in, out);
fftw_destroy_plan(p);
for (int i = 0; i < N; i++) data[i] = out[i].re * out[i].re + out[i].im * out[i].im;
fftw_free(in); fftw_free(out);
std::cout << "CPU calculation: " << std::endl;
}
for (int i = 0; i < N; i++)
std::cout << data[i] << ", ";
std::cout << std::endl;
return 0;
}
$ nvcc t467.cu -o t467 -lcufft_static -lculibos -lfftw -lm
$ ./t467
GPU calculation:
0, 0, 16, 0, 0, 0, 16, 0,
$ CUDA_VISIBLE_DEVICES="" ./t467
CPU calculation:
0, 0, 16, 0, 0, 0, 16, 0,
$
Note that the above example still links dynamically against fftw, so your execution environment (both CPU and GPU) needs to have an appropriate fftwX.so library available. The general process of how to make a linux executable work in a variety of settings (outside of CUDA dependencies) is beyond the scope of this example or what I intend to answer. On linux, ldd
is your friend.