High level GPU programming in C++

Question

I've been looking into libraries/extensions for C++ that will allow GPU-based processing on a high level. I'm not an expert in GPU programming and I don't want to dig too deep. I have a neural network consisting of classes with virtual functions. I need a library that basically does the GPU allocation for me - on a high level. There is a guy who wrote a thesis on a system called GPU++ which does most of the GPU stuff for you. I can't find the code anywhere, just his thesis.

Does anyone know of a similar library, or does anyone have the code for GPU++? Libraries like CUDA are too low level and can't handle most of my operations (at least not without rewriting all my processes and algorithms - which I don't want to do).

Something like this maybe? http://viennacl.sourceforge.net/viennacl-examples-vector.html — stardust, May 08 '13 at 10:23
OpenACC http://www.openacc-standard.org/ or Thrust https://developer.nvidia.com/thrust ? — ShPavel, May 08 '13 at 10:25
You can try [arrayfire](http://www.accelereyes.com/products/arrayfire), or [OpenCV GPU Module](http://opencv.org/) — sgarizvi, May 08 '13 at 11:15
@CiroSantilli新疆改造中心法轮功六四事件 It's still a good question, though it's on the wrong site. Can this question be migrated to [softwarerecs](https://softwarerecs.stackexchange.com/), or will it stay closed forever? — Anderson Green, Oct 19 '19 at 17:05
@AndersonGreen I actually changed my philosophy since then, I now believe we should never ever close anything. It is not possible to migrate after 6 months I believe, the only options is to open a new question. softwarerecs will likely accept it. — Ciro Santilli OurBigBook.com, Oct 19 '19 at 22:04

score 61 · Answer 1 · edited Oct 09 '19 at 17:45

There are many high-level libraries dedicated to GPGPU programming. Since they rely on CUDA and/or OpenCL, they have to be chosen wisely (a CUDA-based program will not run on AMD's GPUs, unless it goes through a pre-processing step with projects such as gpuocelot).

CUDA

You can find some examples of CUDA libraries on the NVIDIA website.

Thrust: the official description speaks for itself

Thrust is a parallel algorithms library which resembles the C++ Standard Template Library (STL). Thrust's high-level interface greatly enhances programmer productivity while enabling performance portability between GPUs and multicore CPUs. Interoperability with established technologies (such as CUDA, TBB, and OpenMP) facilitates integration with existing software.

As @Ashwin pointed out, the STL-like syntax of Thrust makes it a widely chosen library when developing CUDA programs. A quick look at the examples shows the kind of code you will be writing if you decide to use this library. NVIDIA's website presents the key features of this library. A video presentation (from GTC 2012) is also available.

CUB: the official description tells us:

CUB provides state-of-the-art, reusable software components for every layer of the CUDA programming mode. It is a flexible library of cooperative threadblock primitives and other utilities for CUDA kernel programming.

It provides device-wide, block-wide and warp-wide parallel primitives such as parallel sort, prefix scan, reduction, histogram etc.

It is open-source and available on GitHub. It is not high-level from an implementation point of view (you develop in CUDA kernels), but provides high-level algorithms and routines.

mshadow: lightweight CPU/GPU matrix/tensor template library in C++/CUDA.

This library is mostly used for machine learning, and relies on expression templates.

Eigen: support for CUDA with a new Tensor class have been added in version 3.3. It is used by Google in TensorFlow, and is still experimental.

Starting from Eigen 3.3, it is now possible to use Eigen's objects and algorithms within CUDA kernels. However, only a subset of features are supported to make sure that no dynamic allocation is triggered within a CUDA kernel.

OpenCL

Note that OpenCL does more than GPGPU computing, since it supports heterogeneous platforms (multi-core CPUs, GPUs etc.).

OpenACC: this project provides OpenMP-like support for GPGPU. A large part of the programming is done implicitly by the compiler and the run-time API. You can find a sample code on their website.

The OpenACC Application Program Interface describes a collection of compiler directives to specify loops and regions of code in standard C, C++ and Fortran to be offloaded from a host CPU to an attached accelerator, providing portability across operating systems, host CPUs and accelerators.

Bolt: open-source library with STL-like interface.

Bolt is a C++ template library optimized for heterogeneous computing. Bolt is designed to provide high-performance library implementations for common algorithms such as scan, reduce, transform, and sort. The Bolt interface was modeled on the C++ Standard Template Library (STL). Developers familiar with the STL will recognize many of the Bolt APIs and customization techniques.

Boost.Compute: as @Kyle Lutz said, Boost.Compute provides a STL-like interface for OpenCL. Note that this is not an official Boost library (yet).
SkelCL "is a library providing high-level abstractions for alleviated programming of modern parallel heterogeneous systems". This library relies on skeleton programming, and you can find more information in their research papers.

CUDA + OpenCL

ArrayFire is an open-source (used to be proprietary) GPGPU programming library. They first targeted CUDA, but now support OpenCL as well. You can check the examples available online. NVIDIA's website provides a good summary of its key features.

Complementary information

Although this is not really in the scope of this question, there is also the same kind of support for other programming languages:

Python: PyCUDA for CUDA, Clyther and PyOpenCL for OpenCL. There is a dedicated StackOverflow question for this.
Java: JCuda for CUDA, and for OpenCL, you can check this other question.
JavaScript: GPU.JS for WebGl.

If you need to do linear algebra (for instance) or other specific operations, dedicated math libraries are also available for CUDA and OpenCL (e.g. ViennaCL, CUBLAS, MAGMA etc.).

Also note that using these libraries does not prevent you from doing some low-level operations if you need to do some very specific computation.

Finally, we can mention the future of the C++ standard library. There has been extensive work to add parallelism support. This is still a technical specification, and GPUs are not explicitely mentioned AFAIK (although NVIDIA's Jared Hoberock, developer of Thrust, is directly involved), but the will to make this a reality is definitely there.

I used Eigen to invert large matrices and only runs 2x faster than using a one thread code — mathengineer, Sep 13 '18 at 15:38
At the time of this writing, it looks like ArrayFire is one of the few libraries still being maintained, and it seems to be the most versatile. — Paschover, Jul 05 '21 at 13:26
This needs to be updated with NVIDIA [MatX](https://github.com/NVIDIA/MatX) as well as all of the new AMD ROCM alternatives like [rocThrust](https://github.com/ROCmSoftwarePlatform/rocThrust). With Intel GPGPUs on the Horizon, one might even include OneAPI stuff. Also there is SYCL and Kokkos trying to generally bring more portability. — paleonix, Feb 18 '22 at 12:37

score 37 · Accepted Answer · edited Sep 13 '16 at 15:33

37

The Thrust library provides containers, parallel primitives and algorithms. All of this functionality is nicely wrapped up in a STL-like syntax. So, if you are familiar with STL, you can actually write entire CUDA programs using just Thrust, without having to write a single CUDA kernel. Have a look at the simple examples in the Quick Start Guide to see the kind of high-level programs you can write using Thrust.

edited Sep 13 '16 at 15:33

KindDragon

6,558
4
47
75

answered May 10 '13 at 08:52

Ashwin Nanjappa

76,204
83
211
292

1

Now this is what I actually was looking for. Thank you very much. – goocreations May 11 '13 at 08:09
2

@Ashwin, there has been now a successor to Thrust called Bulk. Jared Hoberock gave a presentation on it. Do you have any opinion about it? From the presentation, it looked like highly advanced. – The Vivandiere Nov 04 '14 at 18:45

Kyle Lutz · Answer 3 · 2016-01-07T17:57:12.427

15

Take a look at Boost.Compute. It provides a high-level, STL-like interface including containers like vector<T> and algorithms like transform() and sort().

It's built on OpenCL allowing it to run on most modern GPUs and CPUs including those by NVIDIA, AMD, and Intel.

edited Jan 07 '16 at 17:57

answered May 08 '13 at 12:44

Kyle Lutz

7,966
2
20
23

score 2 · Answer 4 · answered May 18 '13 at 06:38

2

Another high level library is VexCL -- a vector expression template library for OpenCL. It provides intuitive notation for vector operations and is available under MIT license.

answered May 18 '13 at 06:38

ddemidov

1,731
13
15

score 2 · Answer 5 · answered Feb 02 '16 at 20:54

If you're looking for higher-dimensional containers and the ability to pass and manipulate these containers in kernel code, I've spent the last few years developing the ecuda API to assist in my own scientific research projects (so it's been put through the paces). Hopefully it can fill a needed niche. A brief example of how it can be used (C++11 features are used here, but ecuda will work fine with pre-C++11 compilers):

#include <cstdlib>
#include <iomanip>
#include <iostream>
#include <vector>

#include <ecuda/ecuda.hpp>

// kernel function
__global__
void calcColumnSums(
  typename ecuda::matrix<double>::const_kernel_argument mat,
  typename ecuda::vector<double>::kernel_argument vec
)
{
    const std::size_t t = threadIdx.x;
    auto col = mat.get_column(t);
    vec[t] = ecuda::accumulate( col.begin(), col.end(), static_cast<double>(0) );
}

int main( int argc, char* argv[] )
{

    // allocate 1000x1000 hardware-aligned device memory matrix
    ecuda::matrix<double> deviceMatrix( 1000, 1000 );

    // generate random values row-by-row and copy to matrix
    std::vector<double> hostRow( 1000 );
    for( std::size_t i = 0; i < 1000; ++i ) {
        for( double& x : hostRow ) x = static_cast<double>(rand())/static_cast<double>(RAND_MAX);
        ecuda::copy( hostRow.begin(), hostRow.end(), deviceMatrix[i].begin() );
    }

    // allocate device memory for column sums
    ecuda::vector<double> deviceSums( 1000 );

    CUDA_CALL_KERNEL_AND_WAIT(
        calcColumnSums<<<1,1000>>>( deviceMatrix, deviceSums )
    );

    // copy columns sums to host and print
    std::vector<double> hostSums( 1000 );
    ecuda::copy( deviceSums.begin(), deviceSums.end(), hostSums.begin() );

    std::cout << "SUMS =";
    for( const double& x : hostSums ) std::cout << " " << std::fixed << x;
    std::cout << std::endl;

    return 0;

}

I wrote it to be as an intuitive as possible (usually as simple as replacing std:: with ecuda::). If you know STL, then ecuda should do what you'd logically expect a CUDA-based C++ extension to do.

ecuda avoids memory transfer in following call to functions?. What is the performance?. Does it hides the complex memory allocation and transfers? — mathengineer, Sep 13 '18 at 15:31

score 1 · Answer 6 · answered Feb 07 '14 at 19:43

The cpp-opencl project provides a way to make programming GPUs easy for the developer. It allows you to implement data parallelism on a GPU directly in C++ instead of using OpenCL.

Please see http://dimitri-christodoulou.blogspot.com/2014/02/implement-data-parallelism-on-gpu.html

And the source code: https://github.com/dimitrs/cpp-opencl

See the example below. The code in the parallel_for_each lambda function is executed on the GPU, and all the rest is executed on the CPU. More specifically, the “square” function is executed both on the CPU (via a call to std::transform) and the GPU (via a call to compute::parallel_for_each).

#include <vector>
#include <stdio.h>
#include "ParallelForEach.h"

template<class T> 
T square(T x)  
{
    return x * x;
}

void func() {
  std::vector<int> In {1,2,3,4,5,6};
  std::vector<int> OutGpu(6);
  std::vector<int> OutCpu(6);

  compute::parallel_for_each(In.begin(), In.end(), OutGpu.begin(), [](int x){
      return square(x);
  });


  std::transform(In.begin(), In.end(), OutCpu.begin(), [](int x) {
    return square(x);
  });

  // 
  // Do something with OutCpu and OutGpu …..........

  //

}

int main() {
  func();
  return 0;
}

score 1 · Answer 7 · answered Mar 14 '14 at 11:19

1

The new OpenMP version 4 now includes accelerator offload support.

AFAIK GPUs are considered as accelerators.

answered Mar 14 '14 at 11:19

Pietro

12,086
26
100
193

score 0 · Answer 8 · answered May 12 '13 at 17:52

0

C++ AMP is The answer you are looking for.

answered May 12 '13 at 17:52

isti_spl

706
6
10

2

Looking for something cross-platform. AMP seems to Windows specific. – goocreations May 13 '13 at 05:56
Microsoft & AMD released a C++ AMP "with linux support". Is C++AMP a higher level library that hides the Opencl or it is a different solution? – mathengineer Sep 13 '18 at 15:35

High level GPU programming in C++

8 Answers8

CUDA

OpenCL

CUDA + OpenCL

Complementary information

Linked