How to efficiently gather data from threads in CUDA?

Question

I have a application that solves a system of equations in CUDA, I know for sure that each thread can find up to 4 solutions, but how can I copy then back to the host?

I'm passing a huge array with enough space to all threads store 4 solutions (4 doubles for each solution), and another one with the number of solutions per thread, however that's a naive solution, and is the current bottleneck of my kernel.

I really like to optimize this. The main problem is concatenate a variable number of solutions per thread in a single array.

It would be much easier to help if I knew something more about your program. to my knowledge(It's been about a year since I mess with cuda so I might be wrong), memcopies are the only way to retrieve information and they are slow. And what version of cuda on what card? — 8bitwide, Jun 22 '12 at 00:59
The code is too big to put it here. I agree that cudaMemCpy it's the only way to get the results, but I could avoid garbage copying. — RSFalcon7, Jun 22 '12 at 01:07

score 5 · Accepted Answer · edited May 23 '17 at 10:30

The functionality you're looking for is called stream compaction.

You probably do need to provide an array that contains room for 4 solutions per thread because attempting to directly store the results in a compact form is likely to create so many dependencies between the threads that the performance gained in being able to copy less data back to the host is lost by a longer kernel execution time. The exception to this is if almost all of the threads find no solutions. In that case, you might be able to use an atomic operation to maintain an index into an array. So, for each solution that is found, you would store it in an array at an index and then use an atomic operation to increase the index. I think it would be safe to use atomicAdd() for this. Before storing a result, the thread would use atomicAdd() to increase the index by one. atomicAdd() returns the old value, and the thread can store the result using the old value as the index.

However, given a more common situation, where there's a fair number of results, the best solution will be to perform a compacting operation as a separate step. One way to do this is with thrust::copy_if. See this question for some more background.

How to efficiently gather data from threads in CUDA?

1 Answers1

Linked