1

I'm new to CUDA and the thrust library. I'm learning and trying to implement a function that will have a for loop doing a thrust function. Is there a way to convert this loop into another thrust function? Or should I use a CUDA kernel to achieve this?

I have come up with code like this

// thrust functor
struct GreaterthanX
{
    const float _x;
    GreaterthanX(float x) : _x(x) {}

    __host__ __device__ bool operator()(const float &a) const
    {
        return a > _x;
    }
};

int main(void)
{
    // fill a device_vector with
    // 3 2 4 5
    // 0 -2 3 1
    // 9 8 7 6
    int row = 3;
    int col = 4;
    thrust::device_vector<int> vec(row * col);
    thrust::device_vector<int> count(row);
    vec[0] = 3;
    vec[1] = 2;
    vec[2] = 4;
    vec[3] = 5;
    vec[4] = 0;
    vec[5] = -2;
    vec[6] = 3;
    vec[7] = 1;
    vec[8] = 9;
    vec[9] = 8;
    vec[10] = 7;
    vec[11] = 6;

    // Goal: For each row, count the number of elements greater than 2. 
    // And then find the row with the max count

    // count the element greater than 2 in vec
    for (int i = 0; i < row; i++)
    {
        count[i] = thrust::count_if(vec.begin(), vec.begin() + i * col, GreaterthanX(2));
    }

    thrust::device_vector<int>::iterator result = thrust::max_element(count.begin(), count.end());
    int max_val = *result;
    unsigned int position = result - count.begin();

    printf("result = %d at position %d\r\n", max_val, position);
    // result = 4 at position 2

    return 0;
}

My goal is to find the row that has the most elements greater than 2. I'm struggling at how to do this without a loop. Any suggestions would be very appreciated. Thanks.

paleonix
  • 2,293
  • 1
  • 13
  • 29
KLi2708
  • 13
  • 4
  • 1
    thrust is great for simple computation but it is not very efficient due to the many synchronization and not very flexible when the computation start to be complex. Although you certainly could consider the row as a fixed-size-typed values (computed by only one thread), I advise you to use CUB instead for such a use. CUB is more complex to use though (but also fast and much more flexible). – Jérôme Richard Feb 25 '22 at 09:43
  • 1
    @JérômeRichard While this certainly still holds generally, there is a new feature in Thrust that allows to avoid some unnecessary synchronization. It's a new execution policy named `thrust::cuda::par_nosync`. – paleonix Feb 25 '22 at 14:16
  • What you want is `thrust::count_if(vec.begin() + i * col, vec.begin() + (i + 1) * col, GreaterthanX(2));`, right? – paleonix Feb 25 '22 at 14:17
  • Using Thrust, I would try to implement this using a segmented reduction, i.e. `thrust::reduce_by_key`. By using a smart iterator as "key" (maybe a transform iterator taking a counting iterator and dividing the index by `col`) this should be fairly efficient. – paleonix Feb 25 '22 at 14:25
  • Generally Thrust isn't well suited for multi-dimensional data (or at least computation that need multidimensional indices). There is a relatively new high level C++17 library by NVIDIA named [MatX](https://github.com/NVIDIA/MatX), which is explicitly targeted at matrices, tensors etc. It is somewhat similar to pythons numpy. I haven't used it yet, but it may be better for this operation. – paleonix Feb 25 '22 at 14:31
  • 1
    Indeed, this is a very new feature apparently. Previous attempts to reduce synchronizations have been a feature so far. I hope this one is the good one ;) . By the way, a `thrust::reduce_by_key` will certainly be very inefficient in this case (due to a radix sort being certainly used), especially since the number of columns is 4. I had the same problem in the past and a switched to CUB for this reason (I was working on dataframes). I also foud out that they now use CUB internally for several methods and explicitly expose the `thrust::cub` namespace. – Jérôme Richard Feb 25 '22 at 15:58
  • Thank you @JérômeRichard for your suggestion! I will first try the thrust approach paleonix provides and see if its efficiency is good enough for me. Otherwise I will dive into CUB and explore more! – KLi2708 Feb 25 '22 at 17:16

1 Answers1

1

Solution using Thrust

Here is an implementation using thrust::reduce_by_key in conjunction with multiple "fancy iterators".

I also took the freedom to sprinkle in some const, auto and lambdas for elegance and readability. Due to the lambdas, you will need to use the -extended-lambda flag for nvcc.

#include <cassert>
#include <cstdio>

#include <thrust/reduce.h>
#include <thrust/device_vector.h>
#include <thrust/distance.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/discard_iterator.h>
#include <thrust/iterator/transform_iterator.h>

int main(void)
{
    // fill a device_vector with
    // 3 2 4 5
    // 0 -2 3 1
    // 9 8 7 6
    int const row = 3;
    int const col = 4;
    thrust::device_vector<int> vec(row * col);
    vec[0] = 3;
    vec[1] = 2;
    vec[2] = 4;
    vec[3] = 5;
    vec[4] = 0;
    vec[5] = -2;
    vec[6] = 3;
    vec[7] = 1;
    vec[8] = 9;
    vec[9] = 8;
    vec[10] = 7;
    vec[11] = 6;
    thrust::device_vector<int> count(row);

    // Goal: For each row, count the number of elements greater than 2. 
    // And then find the row with the max count

    // count the element greater than 2 in vec

    // counting iterator avoids read from global memory, gives index into vec
    auto keys_in_begin = thrust::make_counting_iterator(0);
    auto keys_in_end = thrust::make_counting_iterator(row * col);
    
    // transform vec on the fly
    auto vals_in_begin = thrust::make_transform_iterator(
        vec.cbegin(), 
        [] __host__ __device__ (int val) { return val > 2 ? 1 : 0; });
    
    // discard to avoid write to global memory
    auto keys_out_begin = thrust::make_discard_iterator();
    
    auto vals_out_begin = count.begin();
    
    // transform keys (indices) into row indices and then compare
    // the divisions are one reason one might rather
    // use MatX for higher dimensional data
    auto binary_predicate = [col] __host__ __device__ (int i, int j){
        return i / col == j / col;
    };
    
    // this function returns a new end for count 
    // b/c the final number of elements is often not known beforehand
    auto new_ends = thrust::reduce_by_key(keys_in_begin, keys_in_end,
                                         vals_in_begin,
                                         keys_out_begin,
                                         vals_out_begin,
                                         binary_predicate);
    // make sure that we didn't provide too small of an output vector
    assert(thrust::get<1>(new_ends) == count.end());

    auto const result = thrust::max_element(count.begin(), count.end());
    int const max_val = *result;
    auto const position = thrust::distance(count.begin(), result);

    std::printf("result = %d at position %d\r\n", max_val, position);
    // result = 4 at position 2

    return 0;
}

Bonus solution using MatX

As mentioned in the comments NVIDIA has released a new high-level, C++17 library called MatX which targets problems involving (dense) multi-dimensional data (i.e. tensors). The library tries to unify multiple low-level libraries like CUFFT, CUSOLVER and CUTLASS in one python-/matlab-like interface. At the point of this writing (v0.2.2) the library is still in initial development and therefore probably doesn't guarantee a stable API. Due to this, the performance not being as optimized as with the more mature Thrust library and the documentation/samples not being quite exhaustive, MatX should not be used in production code yet. While constructing this solution I actually stumbled upon a bug which was instantly fixed. So this code will only work on the main branch and not with the current release v0.2.2 and some used features might not appear in the documentation yet.

A solution using MatX looks the following way:

#include <iostream>
#include <matx.h>

int main(void)
{
    int const row = 3;
    int const col = 4;
    auto tensor = matx::make_tensor<int, 2>({row, col});
    tensor.SetVals({{3, 2, 4, 5},
                    {0, -2, 3, 1},
                    {9, 8, 7, 6}});
    // tensor.Print(0,0); // print full tensor

    auto count = matx::make_tensor<int, 1>({row});
    // count.Print(0); // print full count

    // Goal: For each row, count the number of elements greater than 2.
    // And then find the row with the max count

    // the kind of reduction is determined through the shapes of tensor and count
    matx::sum(count, matx::as_int(tensor > 2));

    // A single value (scalar) is a tensor of rank 0: 
    auto result_idx = matx::make_tensor<matx::index_t>();
    auto result = matx::make_tensor<int>();
    matx::argmax(result, result_idx, count);

    cudaDeviceSynchronize();
    std::cout << "result = " << result() 
              << " at position " << result_idx() << "\r\n";
    // result = 4 at position 2

    return 0;
}

As MatX employs deferred execution operators, matx::as_int(tensor > 2) is effectively fused into the kernel achieving the same as using a thrust::transform_iterator in Thrust.

Due to MatX knowing about the regularity of the problem while Thrust does not, the MatX solution could potentially be more performant than the Thrust solution. It certainly is more elegant. It is also possible to construct tensors in already allocated memory, so one can mix the libraries e.g. my constructing a tensor in the memory of a thrust::vector named vec via passing thrust::raw_pointer_cast(vec.data()) to the constructor of the tensor.

paleonix
  • 2,293
  • 1
  • 13
  • 29
  • Thank you @paleonix for the very detailed help! I'll try this approach and use it in my project. – KLi2708 Feb 25 '22 at 17:19
  • @user18306686 I added another solution via MatX. You probaly shouldn't use it right away due to the library not being mature yet, but maybe seeing its elegance helps keeping it on the horizon. – paleonix Feb 28 '22 at 20:33
  • 1
    Thanks a lot (again!) for the kind help @paleonix. I've successfully used the first approach in my project and have a pretty good performance improvement. But the MatX solution is surprisingly elegant and have the potential of better performance. I'll keep my eye on it and give it a try in the future! – KLi2708 Feb 28 '22 at 23:44