Solution using Thrust
Here is an implementation using thrust::reduce_by_key
in conjunction with multiple "fancy iterators".
I also took the freedom to sprinkle in some const
, auto
and lambdas for elegance and readability. Due to the lambdas, you will need to use the -extended-lambda
flag for nvcc
.
#include <cassert>
#include <cstdio>
#include <thrust/reduce.h>
#include <thrust/device_vector.h>
#include <thrust/distance.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/discard_iterator.h>
#include <thrust/iterator/transform_iterator.h>
int main(void)
{
// fill a device_vector with
// 3 2 4 5
// 0 -2 3 1
// 9 8 7 6
int const row = 3;
int const col = 4;
thrust::device_vector<int> vec(row * col);
vec[0] = 3;
vec[1] = 2;
vec[2] = 4;
vec[3] = 5;
vec[4] = 0;
vec[5] = -2;
vec[6] = 3;
vec[7] = 1;
vec[8] = 9;
vec[9] = 8;
vec[10] = 7;
vec[11] = 6;
thrust::device_vector<int> count(row);
// Goal: For each row, count the number of elements greater than 2.
// And then find the row with the max count
// count the element greater than 2 in vec
// counting iterator avoids read from global memory, gives index into vec
auto keys_in_begin = thrust::make_counting_iterator(0);
auto keys_in_end = thrust::make_counting_iterator(row * col);
// transform vec on the fly
auto vals_in_begin = thrust::make_transform_iterator(
vec.cbegin(),
[] __host__ __device__ (int val) { return val > 2 ? 1 : 0; });
// discard to avoid write to global memory
auto keys_out_begin = thrust::make_discard_iterator();
auto vals_out_begin = count.begin();
// transform keys (indices) into row indices and then compare
// the divisions are one reason one might rather
// use MatX for higher dimensional data
auto binary_predicate = [col] __host__ __device__ (int i, int j){
return i / col == j / col;
};
// this function returns a new end for count
// b/c the final number of elements is often not known beforehand
auto new_ends = thrust::reduce_by_key(keys_in_begin, keys_in_end,
vals_in_begin,
keys_out_begin,
vals_out_begin,
binary_predicate);
// make sure that we didn't provide too small of an output vector
assert(thrust::get<1>(new_ends) == count.end());
auto const result = thrust::max_element(count.begin(), count.end());
int const max_val = *result;
auto const position = thrust::distance(count.begin(), result);
std::printf("result = %d at position %d\r\n", max_val, position);
// result = 4 at position 2
return 0;
}
Bonus solution using MatX
As mentioned in the comments NVIDIA has released a new high-level, C++17 library called MatX which targets problems involving (dense) multi-dimensional data (i.e. tensors). The library tries to unify multiple low-level libraries like CUFFT, CUSOLVER and CUTLASS in one python-/matlab-like interface. At the point of this writing (v0.2.2) the library is still in initial development and therefore probably doesn't guarantee a stable API. Due to this, the performance not being as optimized as with the more mature Thrust library and the documentation/samples not being quite exhaustive, MatX should not be used in production code yet. While constructing this solution I actually stumbled upon a bug which was instantly fixed. So this code will only work on the main branch and not with the current release v0.2.2 and some used features might not appear in the documentation yet.
A solution using MatX looks the following way:
#include <iostream>
#include <matx.h>
int main(void)
{
int const row = 3;
int const col = 4;
auto tensor = matx::make_tensor<int, 2>({row, col});
tensor.SetVals({{3, 2, 4, 5},
{0, -2, 3, 1},
{9, 8, 7, 6}});
// tensor.Print(0,0); // print full tensor
auto count = matx::make_tensor<int, 1>({row});
// count.Print(0); // print full count
// Goal: For each row, count the number of elements greater than 2.
// And then find the row with the max count
// the kind of reduction is determined through the shapes of tensor and count
matx::sum(count, matx::as_int(tensor > 2));
// A single value (scalar) is a tensor of rank 0:
auto result_idx = matx::make_tensor<matx::index_t>();
auto result = matx::make_tensor<int>();
matx::argmax(result, result_idx, count);
cudaDeviceSynchronize();
std::cout << "result = " << result()
<< " at position " << result_idx() << "\r\n";
// result = 4 at position 2
return 0;
}
As MatX employs deferred execution operators, matx::as_int(tensor > 2)
is effectively fused into the kernel achieving the same as using a thrust::transform_iterator
in Thrust.
Due to MatX knowing about the regularity of the problem while Thrust does not, the MatX solution could potentially be more performant than the Thrust solution. It certainly is more elegant. It is also possible to construct tensors in already allocated memory, so one can mix the libraries e.g. my constructing a tensor in the memory of a thrust::vector
named vec
via passing thrust::raw_pointer_cast(vec.data())
to the constructor of the tensor.