I'm trying to unroll an implementation of a function in order to perform optimization in cuda. Basically I have a piece of shared memory which originally was slowing down my code, and by "unrolling" my implementation (reducing the number of total threads, and each thread doing twice the work) I was able to get substantial performance gains. I want to see if I can manage more performance gains with more unrolling, however I made extensive use of tuples in order to get this to happen. I find that a lot of code duplication happens in this process, and I'd like to cut down on the duplication.
Here is an example of the kind of thing that happens frequently in my code:
__device__
thrust::tuple<T,T,T,...> foo(thrust::tuple<G,G,G..> choice_arg...){
//all do the same thing, with very similar args as well.
T value1 = someoperation(thrust::get<0>(choice_arg),...);
T value2 = someoperation(thrust::get<1>(choice_arg),...);
T value3 = someoperation(thrust::get<2>(choice_arg),...);
...
return thrust::make_tuple(value1, value2, value3,...);
}
Instead of writing all the boiler plate here myself, I'd like just have a function like this:
__device__
thrust::tuple<T,T,T,...> foo(thrust::tuple<G,G,G..> choice_arg, ...){
return someoperation<CHOICE_ARG_LENGTH>(choice_arg,...);
}
I've seen how something like this could help, but a normal template loop won't work if I need to return a thrust::tuple
. That solution would work if thrust had thrust::tuple_cat
however they've yet to merge variadic template tuples, despite the work being done in 2014, and I can't even find any talks referencing merging the cat implementation! So is it possible to implement the behavior I'm looking for with out an thrust::tuple_cat implementation on the GPU?
Note that I cannot use arrays for this, after originally using arrays I found that I got a %15 speed improvement for free, seen in both the visual profiler and the real world application of the algorithm I had. The code is very performance critical.