1

I wrote a toy code to test some ideas

#include <thrust/transform.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/reduce.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/execution_policy.h>
#include <iostream>
#include <array>
#include <vector>

#define N 20

struct func {
        __host__ __device__
        float operator()(float x) { return x*2; }
};

template <typename S>
struct O {
        const std::array<float,2> a;
        O(std::array<float,2> a): a(a) {}

        S f;
        __host__ __device__
        float operator()(float &v) {
                std::array<int,3> b = {2,3,4};
                int tmp;
                for (int i=0; i<3; i++) {
                        tmp = thrust::reduce(thrust::device,b.begin(),b.end(),0);
                        printf("%d",tmp);
                }
                return a[0]*v + a[1] + f(a[0]);
        }
};

int main(void) {

        thrust::host_vector<float> _v1(N);
        thrust::device_vector<float> v1 = _v1, v2;
        thrust::fill(v1.begin(),v1.end(),12);
        v2.resize(N);

        std::array<float,2> a{1,2};
        auto c_itor = thrust::make_counting_iterator(0);
        thrust::transform(v1.begin(),v1.end(),v2.begin(),O<func>(a));

        thrust::copy(v2.begin(),v2.end(),std::ostream_iterator<float>(std::cout," "));

}

This code runs perfectly when using nvcc --expt-relaxed-constexpr -std=c++17. One can see that there are a lot of std containers like std::array occur in a __host__ __device__ functor, what I want to know is

  1. is this writing legitimate? (in term of efficiency, not grammar validity)
  2. since the code runs correctly, where do the std objects store? (device or host)
paleonix
  • 2,293
  • 1
  • 13
  • 29
batman216
  • 69
  • 7
  • "This code runs perfectly." On godbolt, this gives warnings during compilation and the output when running seems wrong as well (all zeros). No idea how you arrive at that conclusion. The member functions of C++ STL containers are not marked `__device__`. Or are you using `nvc++`? In that case I don't see a problem for `std::array`, as it just a wrapper around a static C array which lands in registers or local memory. Other STL containers will not be as straightforward though. – paleonix May 17 '23 at 11:45
  • As you are not using the result of `thrust::reduce`, it might be optimized away by the compiler, especially if you use a recent version of Thrust, where `thrust::device` in device code will be handled like `thrust::seq` (i.e. avoiding CUDA Dynamic Parallelism). – paleonix May 17 '23 at 11:48
  • @paleonix I am using nvcc, not nvc++, and `--expt-relaxed-constexpr` option is needed to avoid warnings. – batman216 May 17 '23 at 12:00
  • @paleonix Yes, if I forget the `--expt-relaxed-constexpr` option and ignore all the warnings, the results are zeros. In this case, the code inside the for loop are not working, (the `printf` line does not print anything) – batman216 May 17 '23 at 12:05
  • @paleonix I modified the code, the result of `thrust::reduce` is used now. – batman216 May 17 '23 at 12:12
  • Also the compiler still seems to optimize away the loop, i.e. the compiler ignores that `printf` has side effects (that might be a bug). When you add `tmp` to the returned value you will get the expected `printf`s. – paleonix May 17 '23 at 12:59
  • That's strange, maybe is because I'm using the newest version of hpc-sdk? – batman216 May 17 '23 at 13:06
  • 1
    The warnings were there because I didn't specify `-std=c++17` I think. Now they are gone. Funny that it still worked. – paleonix May 17 '23 at 13:08

1 Answers1

2

The special case of using std::array with C++17 or higher and --expt-relaxed-constexpr works because std::array is a very thin wrapper around a C-style array and with C++17 all member functions that you used are constexpr. I think all member functions but std::array::fill and std::array::swap are constexpr by C++17. These two got the constexpr treatment with C++20.

So for performance considerations your code should perform the same as when using float a[2] and int b[3]. This means that the values are stored in registers if possible (this depends on loop-unrolling for b and generally register pressure). This is fine as long as you don't go overboard with the size of the arrays. See e.g. this answer for a deeper discussion of arrays, registers and local memory.

Other Containers / Alternatives:

For other STL containers using dynamic memory you probably wont be as lucky in terms of member functions being constexpr. The HPC nvc++ compiler (former PGI C++ compiler) does not need __device__ markers, so in theory one can use a lot more STL functionality in device code but in most cases that is a bad idea in terms of performance. STL functions must also still conform to CUDA's C++ Language Restrictions.

Nvidia is developing its own C++ standard library implementation with its own device extensions in libcu++. There are no containers yet, but they might come in the future. For hash tables there is the cuCollections library (WIP).

paleonix
  • 2,293
  • 1
  • 13
  • 29