Self unrolling CUDA n dim same type tuple creation using default thrust in Cuda 8+?

Question

I'm trying to unroll an implementation of a function in order to perform optimization in cuda. Basically I have a piece of shared memory which originally was slowing down my code, and by "unrolling" my implementation (reducing the number of total threads, and each thread doing twice the work) I was able to get substantial performance gains. I want to see if I can manage more performance gains with more unrolling, however I made extensive use of tuples in order to get this to happen. I find that a lot of code duplication happens in this process, and I'd like to cut down on the duplication.

Here is an example of the kind of thing that happens frequently in my code:

__device__
thrust::tuple<T,T,T,...> foo(thrust::tuple<G,G,G..> choice_arg...){
    //all do the same thing, with very similar args as well.
    T value1 = someoperation(thrust::get<0>(choice_arg),...);
    T value2 = someoperation(thrust::get<1>(choice_arg),...);
    T value3 = someoperation(thrust::get<2>(choice_arg),...);
    ...
    return thrust::make_tuple(value1, value2, value3,...);
}

Instead of writing all the boiler plate here myself, I'd like just have a function like this:

__device__
thrust::tuple<T,T,T,...> foo(thrust::tuple<G,G,G..> choice_arg, ...){
    return someoperation<CHOICE_ARG_LENGTH>(choice_arg,...);
}

I've seen how something like this could help, but a normal template loop won't work if I need to return a thrust::tuple. That solution would work if thrust had thrust::tuple_cat however they've yet to merge variadic template tuples, despite the work being done in 2014, and I can't even find any talks referencing merging the cat implementation! So is it possible to implement the behavior I'm looking for with out an thrust::tuple_cat implementation on the GPU?

Note that I cannot use arrays for this, after originally using arrays I found that I got a %15 speed improvement for free, seen in both the visual profiler and the real world application of the algorithm I had. The code is very performance critical.

You mention using arrays is slower. I assume you are talking about run-time looping over the array? Because replacing the tuple by an array in your example should be same in terms of performance. — havogt, Feb 19 '18 at 21:05
@havgot No, statically allocated arrays for my application produced different code which was slower. It didn't make sense to me, but the benchmarks don't lie, I tried this with both my actual application and visual profiler to make sure. — Krupip, Feb 19 '18 at 22:39
@havgot I unrolled first, got a performance gain, then I changed the unrolling to be along tuples, and I got another performance boos with out any other changes. — Krupip, Feb 20 '18 at 14:01
I see... Did you try the c++11 version for the unrolled tupels? — havogt, Feb 20 '18 at 14:03
for CUDA 9+ and C++14 you can also compile with --expt-relaxed-constexpr and use std::tuple and its related functions like std::tuple_cat — dada_dave, Nov 28 '18 at 12:19

havogt · Accepted Answer · 2018-02-19T20:48:49.073

In case you can use CUDA 9 and c++14 you can do the following, for details see e.g. std::integer_sequence.

#include <iostream>
#include <utility>
#include <thrust/tuple.h>

template <typename T>
__device__ T some_operation(T a) {
  return a + 1;  // do something smart
}

template <typename T, std::size_t... I>
__device__ auto foo_impl(const T& t, std::index_sequence<I...>) {
  return thrust::make_tuple(some_operation(thrust::get<I>(t))...);
}

template <typename Tuple>
__device__ auto foo(const Tuple& t) {
  return foo_impl(t,
                  std::make_index_sequence<thrust::tuple_size<Tuple>::value>());
}

__global__ void test_kernel() {
  auto result = foo(thrust::make_tuple(3., 2, 7));
  printf("%f, %d, %d\n", thrust::get<0>(result), thrust::get<1>(result),
         thrust::get<2>(result));
}

int main() {
  test_kernel<<<1, 1>>>();
  cudaDeviceSynchronize();
}

Compile with nvcc -std=c++14 ...

For c++11

you need to

provide your own implementation of index_sequence
use trailing return types.

Here is a working version. Disclaimer: I wrote the index_sequence as it came to my mind. Maybe you want to have at an implementation from the std library.

You can probably find a lot of tutorials about index_sequence/integer_sequence on the web, e.g. on cppreference.com. The basic idea of the index_sequence is that it allows enumerating your tuple (or array) elements. In foo a index_sequence is made which has 0, ..., thrust::tuple_size<Tuple>::value as its template parameters. In foo_impl you capture these indices in the variadic pack and expand it to call some_operation for each of the tuple elements.

#include <iostream>
#include <thrust/tuple.h>

namespace compat {
template <size_t... Indices>
struct index_sequence {};

namespace detail {
template <size_t N, typename Seq = index_sequence<>>
struct make_index_sequence_impl;

template <size_t N, size_t... Indices>
struct make_index_sequence_impl<N, index_sequence<Indices...>> {
  using type = typename make_index_sequence_impl<
      N - 1, index_sequence<N - 1, Indices...>>::type;
};

template <size_t... Indices>
struct make_index_sequence_impl<1, index_sequence<Indices...>> {
  using type = index_sequence<0, Indices...>;
};
}

template <size_t N>
using make_index_sequence = typename detail::make_index_sequence_impl<N>::type;
}

template <typename T>
__device__ T some_operation(T a) {
  return a + 1;  // do something smart
}

template <typename T, std::size_t... I>
__device__ auto foo_impl(const T& t, compat::index_sequence<I...>)
    -> decltype(thrust::make_tuple(some_operation(thrust::get<I>(t))...)) {
  return thrust::make_tuple(some_operation(thrust::get<I>(t))...);
}

template <typename Tuple>
__device__ auto foo(const Tuple& t) -> decltype(foo_impl(
    t, compat::make_index_sequence<thrust::tuple_size<Tuple>::value>())) {
  return foo_impl(
      t, compat::make_index_sequence<thrust::tuple_size<Tuple>::value>());
}

__global__ void test_kernel() {
  auto result = foo(thrust::make_tuple(3., 2, 7));
  printf("%f, %d, %d\n", thrust::get<0>(result), thrust::get<1>(result),
         thrust::get<2>(result));
}

int main() {
  test_kernel<<<1, 1>>>();
  cudaDeviceSynchronize();
}

This makes sense for C++14, but I'm trying to use cuda 8 atm, and it will be a while before I can safely go over to cuda 9. I didn't even realize trailing return types was something you could use outside of lambdas, and I'm not sure how that applies. Additionally, I looked at how to other people implemented [integer_sequence](https://stackoverflow.com/a/32223343/2036035), but It doesn't even seem like the same language, I can't tell how they are doing anything, eg, why are they using sizeof everywhere, or why size() is defined, but never used. — Krupip, Feb 19 '18 at 19:46
Ok, no problem. I will add a c++11 version with some explanation. — havogt, Feb 19 '18 at 20:02
I upgraded to cuda 9.1 and used your first version and it worked. — Krupip, Apr 26 '18 at 13:36

Self unrolling CUDA n dim same type tuple creation using default thrust in Cuda 8+?

1 Answers1

For c++11