Assume that we have 2^10 CUDA cores and 2^20 data points. I want a kernel that will process these points and will provide true/false for each of them. So I will have 2^20 bits. Example:
bool f(x) { return x % 2? true : false; }
void kernel(int* input, byte* output)
{
tidx = thread.x ...
output[tidx] = f(input[tidx]);
...or...
sharedarr[tidx] = f(input[tidx]);
sync()
output[blockidx] = reduce(sharedarr);
...or...
atomic_result |= f(input[tidx]) << tidx;
sync(..)
output[blckidx] = atomic_result;
}
Thrust/CUDA has some algorithms as "partitioning", "transformation" which provides similar alternatives.
My question is, when I write the relevant CUDA kernel with a predicate that is providing the corresponding bool result,
should I use one byte for each result and directly store the result in the output array? Performing one step for calculation and performing another step for reduction/partitioning later.
should I compact the output in the shared memory, using one byte for 8 threads and then at the end write the result from shared memory to output array?
should I use atomic variables?
What's the best way to write such a kernel and the most logical data structure to keep the results? Is it better to use more memory and simply do more writes to main memory instead of trying to deal with compacting the result before writing back to result memory area?