in-place reduce sum for CUDA (CUB/Thrust)?

Question

I have a device vector that needs to be transformed in multiple ways (e.g. creating 20 new arrays from it) and then reduce all (sum/accumulate), returning those sums in a host vector. The code is working with thrust::transform_reduce but looking at nvvp it makes a lot of cudaMalloc/cudaFree/cudaMemcpy that slow down the algorithm. This will run in a loop so my idea is to have some cache memory pre-allocated and cudaMemcpy all the results once in the end for each iteration. What I need to make it work is a reduce that works in-place, in that pre-allocated memory.

The cub::DeviceReduce::Sum almost does it, but it seems to have the input on the host memory, copy it to device temp storage and copy back the result to host in the end, I wanna avoid all those copies. My next choice is cub inclusive_scan but I don't need all those partial sums, only the final one, but even with that it may be faster since won't do any malloc/memcpy.

Is there any way to do this reduce in-place with those libraries (CUB/Thrust) to save malloc and memcpy times? Or the way is to code some custom kernel for it?

`cub::DeviceReduce` does not expect the input to be on host memory, so it should allow you to preallocate all necessary working/scratch memory needed by the algorithm, and reuse it in your loop. You can achieve something similar with thrust by providing your own allocator. There are already questions here on the `cuda` SO tag that outline both approaches. Such as [here](https://stackoverflow.com/questions/48670284/cuda9-thrust-sort-by-key-overlayed-with-h2d-copy-using-streams). Your question is arguably a duplicate of that one. — Robert Crovella, Dec 06 '19 at 14:51
So, no real in-place reduce in all those CUDA libraries? The answer by @RobertCrovella does help in reusing a preallocated memory, but it remains using the D2H copy in each reduce. I wanted an in-place reduce to avoid both mallocs and memcpys. — Vinicius Pavanelli, Dec 07 '19 at 20:05
`thrust::reduce_by_key` can avoid the final D2H copy of the reduced variable. cub device reduce has no such final D2H copy: all data originates and stays on the GPU. — Robert Crovella, Dec 07 '19 at 20:34

in-place reduce sum for CUDA (CUB/Thrust)?

0 Answers0