Using thrust to handle vectors in CUDA classes?

Question

I have a question about the applicability of the thrust into c++ classes. I am trying to implement a class object that receives (x,y,z) coordinates of vertices as ver1, ver2, and ver3. Then, assigns to a triangle and calculates area and normal vector.

However, I didn’t quite understand how to create a class of thrust vectors.

Here are the coordinates of the vertices I read it from a file and I would like to send them to a class that will assign them into triangles. Here we are in main.

        thrust::host_vector<double> dum(start, end); //dum has coordinates and I create
vertices from it.
        thrust::host_vector<double> ver1(dum.begin(),dum.begin()+3); //initilizing elements in CPU.
        thrust::host_vector<double> ver2(dum.begin()+3,dum.begin()+6);
        thrust::host_vector<double> ver3(dum.begin()+6,dum.end());

        thrust::device_vector<double> ver1_gpu = ver1; //copying CPU vectors to GPU vectors.
        thrust::device_vector<double> ver2_gpu = ver2;
        thrust::device_vector<double> ver3_gpu = ver3;
        triangle(ver1_gpu, ver2_gpu, ver3_gpu);

In the triangle class, I tried to initialize 3 vertices that have all zeros for their first 3 elements. Since each vertices have 3 coordinates.(x, y, and z). and I also initialize area and normal variables.

class triangle
{
    thrust::device_vector<double>v1(3,0);
    thrust::device_vector<double>v2(3,0);
    thrust::device_vector<double>v3(3,0);
    thrust::device_vector<double>E1(3,0);
    thrust::device_vector<double>E2(3,0);
    thrust::device_vector<double>E3(3,0);
    double normal;
    double dummy
    double area;

public:
    __device__ __host__ triangle(device_vector<double>vert1, device_vector<double>vert2, device_vector<double>vert3)
    {
        triangle.v1 = vert1;
        triangle.v2 = vert2;
        triangle.v3 = vert3;
        triangle.E1 = vert2 - vert1;
        triangle.E2 = vert3 - vert1;
        dummy = cross(obj.E2, obj.E1);%% Cross product
        triangle.Area = norm(dummy) / 2;
        triangle.Normal = dummy / norm(dummy);
    }
};

I’d like to do all of my calculations in the device. I am new to cuda and its libraries and I know I am wrong in many places but I seek your help.

Device vectors are host only. You can’t use them in device code. Your design cannot work — talonmies, Oct 23 '22 at 12:59
Thank you for your answer @talonmies . but even If I create those vectors as thrust::host_vectorvert1(3, 0); it says expected a type specifier. — Seb Seb, Oct 23 '22 at 13:07
You can’t initialise non constant variables as part of their definition within a class. Initialisation goes in the constructor — talonmies, Oct 24 '22 at 06:25
@talonmies [You can](https://stackoverflow.com/q/18811742/10107454), just not with `(...)`. Braced initializers are no problem. Just the old problem with `vector`'s initializer list constructor. — paleonix, Oct 24 '22 at 10:35
@SebSeb You not only seem to be new to CUDA, but to C++ as well. Else you would know that `vector` isn't the right container for a "vector" as it is used in 3D geometry/linear algebra. Also you would be passing vectors as references and not as values. `vert2 - vert1` wont work either, one would use a `transform` for that. Please take a look at the [Thrust examples](https://github.com/NVIDIA/thrust/tree/main/examples) before wasting your time by writing nonsense. — paleonix, Oct 24 '22 at 10:41
For GPU computing and Thrust you need to start thinking in the data-parallel model. To get good performance you normally want to use a struct of arrays (SoA) design, i.e. one `vector` for the `x` coordinates of all your vertices, one vector for all `y` coordinates, and so on. Take a look at the [weld vertices sample](https://github.com/NVIDIA/thrust/blob/main/examples/weld_vertices.cu) specifically. — paleonix, Oct 24 '22 at 10:45
Hi! Thank you for your advices @paleonix I couldn't find any examples on how to implement classes with cuda. Does it mean I should use a struct instead of a class? — Seb Seb, Oct 25 '22 at 15:40
Thrust is written in a way that doesn't depend on CUDA, you can also use e.g. OpenMP as a backend. Therefore writing classes is no different than with pure C++. You just want to have classes that capture full datasets instead of one element each for the mentioned reasons. — paleonix, Oct 25 '22 at 16:54
I am still struggling to create a class that receives the coordinates of the vertices of the triangles and does the subtraction on the GPU. if you fix my code to do it, It would be very helpful for me to understand and to further my project. @paleonix — Seb Seb, Oct 25 '22 at 17:03

paleonix · Accepted Answer · 2023-01-28T17:40:57.137

The following code is intended to show how to avoid unnecessary copies through move semantics (not Thrust specific) and initialization through clever use of Thrusts "fancy iterators" (thrust::transform_iterator and thrust::zip_iterator) in a C++ class leveraging Thrust vectors to offload computation. As our class is supposed to handle many triangles at once to utilize the resources of a modern GPU and recuperate the associated overheads, it is named Triangles. To achieve good performance in bandwidth-bound applications like this on the GPU, coalescing of global memory accesses is key. A straightforward way of achieving this is to use so-called Struct-of-Array (SoA) semantics, e.g.

struct SoA_example {
    double x[N];
    double y[N];
    double z[N];
};

instead of Array-of-Structs semantics, e.g.

struct Vertex {
    double x;
    double y;
    double z;
}
Vertex AoS_example[N];

This vocabulary somewhat clashes with the names of the C++ and Thrust containers used below, as our "array" is the thrust::device_vector while std::array is used as a "struct".

Depending on the context one could think of many other constructors than the one showed below that e.g. handle data transfer from host to device, read values from file or handle (host) input in AoS format.

OPs question defines dummy and normal as scalars which doesn't make sense mathematically. The cross-product of two vectors is another vector. I corrected this here.

The following code is untested but compiles.

#include <cstddef>

#include <array>
#include <utility>

#include <thrust/device_vector.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/zip_function.h>

template <typename T>
struct CrossProduct {
    __host__ __device__
    T operator()(T a, T b, T c, T d) const {
        return a * b - c * d;
    }
};

template <typename T>
struct TriangleArea {
    __host__ __device__
    T operator()(T x, T y, T z) const {
        return sqrt(x * x + y * y + z * z) / 2;
    }
};

template <typename T>
struct NormalizeUsingArea {
    __host__ __device__
    T operator()(T &val_x, T &val_y, T &val_z, T area) const {
        T norm = 0.5 / area;
        val_x *= norm;
        val_y *= norm;
        val_z *= norm;
    }
};

template <typename T>
class Triangles
{
    static constexpr int x_dim = 0;
    static constexpr int y_dim = 1;
    static constexpr int z_dim = 2;
    static constexpr int n_dims = 3;

    using Container = thrust::device_vector<T>;
    using Vectors = std::array<Container, n_dims>;

    std::ptrdiff_t n_triangles{};
    Vectors v1;
    Vectors v2;
    Vectors v3;

    Vectors E1;
    Vectors E2;
    // Vectors E3;

    // Vectors dummies;
    Vectors normals;
    Container areas;
    

    // helper functions
    auto make_minus_iterator(Vectors &vs_1,
                             Vectors &vs_2,
                             int component)
    {
        return thrust::make_transform_iterator(
                    thrust::make_zip_iterator(
                        thrust::make_tuple(vs_1[component].cbegin(),
                                           vs_2[component].cbegin())),
                    thrust::make_zip_function(thrust::minus<T>{}));
    }

    template <int component>
    auto choose_crossprod_components(Vectors &vs_1,
                                     Vectors &vs_2)
    {
        static_assert(component >= x_dim && component < n_dims);
        if constexpr (component == x_dim)
        {
            return thrust::make_tuple(vs_1[y_dim].cbegin(),
                                      vs_2[z_dim].cbegin(),
                                      vs_1[z_dim].cbegin(),
                                      vs_2[y_dim].cbegin());
        }
        if constexpr (component == y_dim)
        {
            return thrust::make_tuple(vs_1[z_dim].cbegin(),
                                      vs_2[x_dim].cbegin(),
                                      vs_1[x_dim].cbegin(),
                                      vs_2[z_dim].cbegin());
        }
        if constexpr (component == z_dim)
        {
            return thrust::make_tuple(vs_1[x_dim].cbegin(),
                                      vs_2[y_dim].cbegin(),
                                      vs_1[y_dim].cbegin(),
                                      vs_2[x_dim].cbegin());
        }
    }

    template <int component>
    auto make_crossprod_iterator(Vectors &vs_1,
                                 Vectors &vs_2)
    {   
        return thrust::make_transform_iterator(
                    thrust::make_zip_iterator(choose_crossprod_components<component>(vs_1, vs_2)),
                    thrust::make_zip_function(CrossProduct<T>{}));
    }

    auto make_area_iterator(Vectors &crossproduct)
    {
        return thrust::make_transform_iterator(
                    thrust::make_zip_iterator(
                        thrust::make_tuple(crossproduct[x_dim].cbegin(),
                                           crossproduct[y_dim].cbegin(),
                                           crossproduct[z_dim].cbegin())),
                    thrust::make_zip_function(TriangleArea<T>{}));
    }

    auto make_zip_iterator(Vectors &vecs,
                           const Container &scalars)
    {
        return thrust::make_zip_iterator(
                    thrust::make_tuple(vecs[x_dim].begin(),
                                       vecs[y_dim].begin(),
                                       vecs[z_dim].begin(),
                                       scalars.cbegin()));
    }

public:
    // The following constructor is just an example based on OPs constructor.
    // Use fancy iterators to avoid unnecessary initialization to 0 of E1, E2, ...
    // Depending on the use case it might make more sense to just have the iterators as members
    // and compute E1, E2, etc on the fly when needed and get rid of their Vectors members (kernel fusion).
    Triangles(thrust::device_vector<T> &&vert1_x,
              thrust::device_vector<T> &&vert1_y,
              thrust::device_vector<T> &&vert1_z,
              thrust::device_vector<T> &&vert2_x,
              thrust::device_vector<T> &&vert2_y,
              thrust::device_vector<T> &&vert2_z,
              thrust::device_vector<T> &&vert3_x,
              thrust::device_vector<T> &&vert3_y,
              thrust::device_vector<T> &&vert3_z) :
        n_triangles{static_cast<std::ptrdiff_t>(vert1_x.size())},
        // move device_vectors with vertices into class (avoids expensive copies)
        v1{std::move(vert1_x),
           std::move(vert1_y),
           std::move(vert1_z)},
        v2{std::move(vert2_x),
           std::move(vert2_y),
           std::move(vert2_z)},
        v3{std::move(vert3_x),
           std::move(vert3_y),
           std::move(vert3_z)},
        // calculate diffs and initialize E1, E2 with them
        E1{Container(make_minus_iterator(v2, v1, x_dim),
                     make_minus_iterator(v2, v1, x_dim) + n_triangles),
           Container(make_minus_iterator(v2, v1, y_dim),
                     make_minus_iterator(v2, v1, y_dim) + n_triangles),
           Container(make_minus_iterator(v2, v1, z_dim),
                     make_minus_iterator(v2, v1, z_dim) + n_triangles)},
        E2{Container(make_minus_iterator(v3, v1, x_dim),
                     make_minus_iterator(v3, v1, x_dim) + n_triangles),
           Container(make_minus_iterator(v3, v1, y_dim),
                     make_minus_iterator(v3, v1, y_dim) + n_triangles),
           Container(make_minus_iterator(v3, v1, z_dim),
                     make_minus_iterator(v3, v1, z_dim) + n_triangles)},
        // calculate cross-products and initialize normals with them(normalize later)
        normals{Container(make_crossprod_iterator<x_dim>(E2, E1),
                          make_crossprod_iterator<x_dim>(E2, E1) + n_triangles),
                Container(make_crossprod_iterator<y_dim>(E2, E1),
                          make_crossprod_iterator<y_dim>(E2, E1) + n_triangles),
                Container(make_crossprod_iterator<z_dim>(E2, E1),
                          make_crossprod_iterator<z_dim>(E2, E1) + n_triangles)},
        // calculate areas and initialize with them
        areas(make_area_iterator(normals),
              make_area_iterator(normals) + n_triangles)
    {
        // normalize normals
        thrust::for_each_n(make_zip_iterator(normals, areas),
                           n_triangles,
                           thrust::make_zip_function(NormalizeUsingArea<double>{}));
    }
};

// expicit instantiation to find compilation errors on godbolt.com
template class Triangles<double>;

Hello @paleonix. I wonder why did you use the same container Vectors both for E1 and normals. I didn't understand because normals are double values and E1 is a vector. and If I am trying to initialize a 2D vector. How can I do it? Should I create another container like using Container2D = thrust::device_vector> and using Vectors2D = std::array>? — Seb Seb, Jan 28 '23 at 12:41
@SebSeb `normals` are also vectors, not scalars (normalized cross product of two vectors). Only `areas` are scalars. This is an error in the question which I just corrected without mentioning it, which I should improve. Thanks for the feedback. What are the dimensions of your 2D data? Are some of the known at compile time? — paleonix, Jan 28 '23 at 17:36
While 9 vectors for each collection of matrices would work, this would amount to a lot of boilerplate. I would go with a single vector of size `3 * 3 * num_matrices` and then have a device function doing the index calculation, s.t. first come all the `0,0` elements, than all `0,1` elements and so on until `3,3`. That way you get coalescing. Ideally `num_matrices` is some power of two or multiple of a device L2 cache page size (with respect to the used type). — paleonix, Jan 29 '23 at 09:06

Using thrust to handle vectors in CUDA classes?

1 Answers1