1

I feel like there has to be a way to write code such that it can run either in CPU or GPU. That is, I want to write something that has (for example), a CPU FFT implementation that can be executed if there is no GPU, but defaults to a GPU FFT when the GPU is present. I haven't been able to craft the right question to get the interwebs to offer up a solution.

My application target has GPUs available. We want to write certain functions to use the GPUs. However, our development VMs are a different story. It seems very desirable to be able to run a code/unit-test cycle without having to jump to GPU hardware.

If I need to do some clever run-time checking/library loading, I'm OK with that; I just need a cookbook.

How do people do continuous integration of GPU-enabled code?

Target environment is nVidia/CUDA. I'm new to GPU code, so maybe this is an FAQ (but I haven't found it yet).

jwm
  • 1,504
  • 1
  • 14
  • 29
  • "such that it can run either in CPU or GPU". I think you need to elaborate by what you mean by that – talonmies May 15 '19 at 19:47
  • updated. I hope that's clearer – jwm May 15 '19 at 19:50
  • 1
    Surely with enough abstraction you can get to an agnostic interface for functionality that would utilize a different "back-end" depending on the platform? That of course means two back ends, not code "such that it can run either in CPU or GPU" And means your unit tests can't exercise the whole code base – talonmies May 15 '19 at 19:58
  • I'm looking for information on how to do the abstraction. A brief look at OpenCL implies that it provides this abstraction, but we have a vested interest in staying with CUDA – jwm May 15 '19 at 20:01
  • You keep writing things which seem to imply something like "How can I run CUDA code without a GPU". You can't do that and abstraction doesn't mean that either – talonmies May 15 '19 at 20:18
  • I apologize for not having the educated vocabulary. Obviously, since I'm asking the question, I don't know what I'm talking about and I'm looking for help in learning. What I want is a run-time "do I have a GPU?" switch so I can take one code path or the other – jwm May 15 '19 at 20:35
  • It's indeed not entirely clear what you imagine. Of course, you can add some `if (cuda) runCuda(); else runCpu();` in your `main` and go down completely different code paths. If you want to have a set of library functions, you could define an interface and implement it in two different ways. And as you noticed: OpenCL allows *the same kernel code* to be "executed" on any device (GPU or CPU). But let's face it: NVIDIA is definitely not interested in offering a "CUDA for CPUs". They want to sell their GPUs. – Marco13 May 15 '19 at 23:25

1 Answers1

3

What I want is a run-time "do I have a GPU?" switch so I can take one code path or the other

I believe this should be pretty straightforward.

The typical approach would be to:

  1. Link your code statically against the CUDA Runtime library (cudart) library. If you compile with nvcc, this is the default behavior.

  2. (Presumably) near the beginning of your code, choose a CUDA Runtime API call such as cudaGetDevice(). Use some form of proper CUDA error checking (always a good idea, anyway). In this case we will use the error return from this first runtime API call to make our path decision (as opposed to just simply terminating the application).

  3. If the runtime API call in step 2 above returns cudaSuccess (as the functional return value, not the device index), then it is safe to assume that there is at least 1 functional CUDA GPU. In that case, further inspection of the environment could be done if desired/needed, perhaps following a sequence similar to the CUDA deviceQuery sample code. This status could be stored in your program for future decision making about code paths to follow.

  4. If the runtime API call in step 2 returns anything other than cudaSuccess, it almost certainly means that CUDA is non-functional, perhaps because there is no CUDA GPU. In that case, I'd advise against any further use of any CUDA API or library, and from there on your code should use host-only code paths.

Here is a fully worked example. It uses the CUFFT library to perform a simple FFT operation if a functional CUDA environment is found. Otherwise it uses FFTW to do the same thing in host code. Note that in addition to statically linking against the cudart library (the default with nvcc, so not obvious), I am also statically linking against the CUFFT library. At least on linux as in the example here, this prevents failures at application launch time because of an inability to find dynamic libraries to link against (which would prevent the application from running at all; whereas in that case our intent would be that the application runs but chooses host code paths).

$ cat t467.cu
#include <cufft.h>
#include <fftw.h>
#include <iostream>

int main(){

  double data[] = {0.0f, 1.0f, 0.0f, -1.0f, 0.0f, 1.0f, 0.0f, -1.0f};
  int N = sizeof(data)/sizeof(data[0]);
  int dev = 0;
  if (cudaGetDevice(&dev) == cudaSuccess) {
    // GPU code path
    cufftDoubleComplex *din, *dout, *in, *out;
    in  = new cufftDoubleComplex[N];
    out = new cufftDoubleComplex[N];
    for (int i = 0; i < N; i++) in[i].x = data[i];
    cudaError_t err = cudaMalloc(&din,  sizeof(din[0]) * N);
                err = cudaMalloc(&dout, sizeof(din[0]) * N);
    cufftHandle plan;
    cufftResult cstat = cufftPlan1d(&plan, N, CUFFT_Z2Z, 1);
    cudaMemcpy(din, in, N*sizeof(din[0]), cudaMemcpyHostToDevice);
    cstat = cufftExecZ2Z(plan, din, dout, CUFFT_FORWARD);
    cudaMemcpy(out, dout, N*sizeof(din[0]), cudaMemcpyDeviceToHost);
    for (int i = 0; i < N; i++) data[i] = out[i].x * out[i].x + out[i].y * out[i].y;
    cudaFree(din); cudaFree(dout);
    delete[] in;  delete[] out;
    cufftDestroy(plan);
    std::cout << "GPU calculation: " << std::endl;
    }
  else {
    // CPU code path
    fftw_complex *in, *out;
    fftw_plan p;
    in = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N);
    out = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N);
    for (int i = 0; i < N; i++) {in[i].re= data[i]; in[i].im = 0;}
    p = fftw_create_plan(N, FFTW_FORWARD, FFTW_ESTIMATE);
    fftw_one(p, in, out);
    fftw_destroy_plan(p);
    for (int i = 0; i < N; i++) data[i] = out[i].re * out[i].re + out[i].im * out[i].im;
    fftw_free(in); fftw_free(out);
    std::cout << "CPU calculation: " << std::endl;
    }
  for (int i = 0; i < N; i++)
    std::cout << data[i] << ", ";
  std::cout << std::endl;
  return 0;
}
$ nvcc t467.cu -o t467 -lcufft_static -lculibos -lfftw -lm
$ ./t467
GPU calculation:
0, 0, 16, 0, 0, 0, 16, 0,
$ CUDA_VISIBLE_DEVICES="" ./t467
CPU calculation:
0, 0, 16, 0, 0, 0, 16, 0,
$

Note that the above example still links dynamically against fftw, so your execution environment (both CPU and GPU) needs to have an appropriate fftwX.so library available. The general process of how to make a linux executable work in a variety of settings (outside of CUDA dependencies) is beyond the scope of this example or what I intend to answer. On linux, ldd is your friend.

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257