2

In the lab we have an heterogeneous cluster setup, with many Intel CPUs, a few AMD CPUs and a couple of Nvidia GPUs.

For HPC development, the one thing I know that I could write once and run everywhere on this setup is OpenCL (not even Java ;) ). But here in the lab we are very used to using C or Fortran plus MPI to develop for running entirely on CPU, and maybe, rarely, someone may need to use the Nvidia's node to run something in CUDA.

Now, at the start of a new project, I thought it would be very nice to be able to code things in MPI + OpenCL, and be able include in the distributed processing both GPUs and CPUs, running the same OpenCL code. So, is it advisable, is OpenCL implementations ready for such task? With OpenCL code running on CPU with Intel SDK, can I count with as good performance as I would have with a multithreaded C program compiled with Intel's compiler? Can you point to comparisons and benchmarks?

lvella
  • 12,754
  • 11
  • 54
  • 106

3 Answers3

3

OpenCL is portable, but it is not performance portable. You should not expect OpenCL to be write-once-run-fast-everywhere. OpenCL code written for GPUs may run poorly on CPUs and I would not expect uniform performance across GPUs, particularly ones from different vendors.

To answer your specific question, based upon numerous third-party evaluations, no, I would not expect OpenCL to beat well-written C+OpenMP on Intel CPUs. There are a number of reasons for this.

Please recognize that the error bars on my answer are extremely large due to the broad nature of the question. It would be helpful to have more detail on what you intend to compute.

Jeff Hammond
  • 5,374
  • 3
  • 28
  • 45
1

I've had good luck porting my OpenCL code across CPU and GPU. My project was Levenberg-Marquardt that I wrote all in C first to debug it, then ported it to OpenCL on an Intel CPU to check the results and do a little more debugging, then OpenCL on an AMD GPU.

The best trick I've found to really write good OpenCL code across devices is to buffer global memory to local memory even if you are using a CPU, since that is usually the bottleneck on the GPU. The second bottleneck I found on GPU vs. CPU was kernel size, the CPU can handle larger kernels than the GPU, so mind the type of memory you use for constants, how much local memory is allocated, etc.

It has been about 6 months, so maybe it is fixed, but the AMD FFT worked great on an Intel CPU, GPU, and an AMD GPU but didn't work on an NVIDIA GPU. The AMD forums had a thread where it was attributed to NVIDIA not supporting some of the vector features.

Austin
  • 1,018
  • 9
  • 20
1

In addition to the other answers, and emphasizing one important point again: The question is very broad, and the performance will depend on many factors that you did not mention in the question. You might already be aware of these factors, but in doubt, a summary may be found in this answer (the question might seem unrelated at the first glance, and refers to CUDA, but many concepts apply to OpenCL as well)

One of the main driving ideas behind OpenCL is heterogeneous computing (remarkably, the page does not even mention OpenCL...). That is, OpenCL aims at offering the developer the possibility to exploit all available processing resources, ranging from a single ARM core to multiple high-end GPUs with thousands of cores.

This versatility comes at a cost. Certain concepts are implicitly tailored for many-core architectures (or at least, this seems to be the main application area until now). And in any case, "optimizing" an OpenCL program often just means "tweaking it in order to run particularly fast on one certain architecture". Things like vectorization or shared memory may be advantageous on one platform or not available at all on the other.

There are some possibilities to circumvent this, or to at least try to make one OpenCL program more "agnostic" of the hardware that it will run on. One obvious option is to query the target platform properties (for example, preferred vector sizes or whether shared memory is available), and launch different kernels depending on the result. Thanks to the built-in compiler in OpenCL, it is even possible to include platform specific optimizations, for example, via #defines, into the kernel source code. However, it is hard to make general statements about the effort-to-performance-gain ratio for this kind of optimizations. And it's even harder to predict whether the possibly reduced performance of a "generic" OpenCL implementation (compared to a perfectly tweaked C-implementation) will not sooner or later be compensated, when the number of cores increase and the OpenCL compilers become better.

So my recommendation would be to do some benchmarks of "representative" tasks, and see whether the performance is competitive among the different devices, keeping in mind that the average number of cores of each device (and, most likely, the general heterogenity of the devices) will increase, and OpenCL might make it easier to adapt to these changes.

Community
  • 1
  • 1
Marco13
  • 53,703
  • 9
  • 80
  • 159