I have an algorithm in which I repeatedly do a reduce_by_key and get a vector of reduced values out, one per keyed segment. I keep track of the best output value for each segment over successive iterations. It looks like this:
thrust::device_vector<int> keys(N), values(N), outKeys(N), outValues(N), outValuesBest(N);
// Set up the keys
// Initialize outValuesBest
while (1) {
// Get some values (keys stay the same across iterations)
// Do the reduce
auto outEnd = thrust::reduce_by_key(thrust::device,
keys.begin(), keys.end(), values.begin(),
outKeys.begin(), outValues.begin(),
[] __device__(int ka, int kb) { return ka == kb; },
[] __device__(int a, int b) { return min(a, b); });
size_t nSegments = outEnd.first - outKeys.begin();
auto outValuesp = outValues.begin();
auto outValuesBestp = outValuesBest.begin();
// Update the per-segment vector of best results
thrust::for_each_n(thrust::device,
thrust::counting_iterator<size_t>(0), nSegments, [=] __device__(size_t i) {
if (outValuesp[i] < outValuesBestp[i]) {
outValuesBestp[i] = outValuesp[i];
}
});
}
As you can see, my current implementation uses a separate algorithm to compare each element of the reduced output and update the corresponding element in the best vector for each segment that improved. I would like to get rid of the second operation and the second copy of the array (and its associated bandwidth). This is a job for kernel fusion using fancy iterators in Thrust. Does anyone know of a way to fuse these two kernels?
I'm envisioning a new kind of fancy iterator to handle this. Call it a conditional_discard_iterator. It would allow me to say the following instead. The idea is that it only actually writes the output element to the underlying iterator if the predicate returns true and discards it if false.
thrust::device_vector<int> keys(N), values(N), outKeys(N), outValuesBest(N);
// Set up the keys...
// Initialize outValuesBest...
while (1) {
// Get some values...
auto OutIt = make_conditional_discard_iterator(outValuesBest.begin(),
[] __device__(int newValue, int fromOutValuesBest)
{ return newValue < fromOutValuesBest; });
auto outEnd = thrust::reduce_by_key(thrust::device,
keys.begin(), keys.end(), values.begin(),
outKeys.begin(), OutIt,
[] __device__(int ka, int kb) { return ka == kb; },
[] __device__(int a, int b) { return min(a, b); });
size_t nSegments = outEnd.first - outKeys.begin();
}
The inputs to the predicate could be a bit more general but otherwise this seems like a pretty elegant solution. I'm just hoping something like it already exists. If not in Thrust, is there in Boost or STL?
To some it is less than obvious that this question is not a duplicate of another, simply because it involves the subject of fancy output iterators.