2

Assume that we have 2^10 CUDA cores and 2^20 data points. I want a kernel that will process these points and will provide true/false for each of them. So I will have 2^20 bits. Example:

bool f(x) { return x % 2? true : false; }

void kernel(int* input, byte* output)
{
   tidx = thread.x ...
   output[tidx] = f(input[tidx]);

   ...or...

   sharedarr[tidx] = f(input[tidx]);
   sync()
   output[blockidx] = reduce(sharedarr);

   ...or...

   atomic_result |= f(input[tidx]) << tidx;
   sync(..)
   output[blckidx] = atomic_result;
}

Thrust/CUDA has some algorithms as "partitioning", "transformation" which provides similar alternatives.

My question is, when I write the relevant CUDA kernel with a predicate that is providing the corresponding bool result,

  • should I use one byte for each result and directly store the result in the output array? Performing one step for calculation and performing another step for reduction/partitioning later.

  • should I compact the output in the shared memory, using one byte for 8 threads and then at the end write the result from shared memory to output array?

  • should I use atomic variables?

What's the best way to write such a kernel and the most logical data structure to keep the results? Is it better to use more memory and simply do more writes to main memory instead of trying to deal with compacting the result before writing back to result memory area?

tomix86
  • 1,336
  • 2
  • 18
  • 29
phoad
  • 1,801
  • 2
  • 20
  • 31
  • 5
    If you're interested in performance, it's very often the case (e.g. with Thrust) that by minimizing the amount of data traffic to global memory, you can maximize performance. If your other data use cases work equally well with boolean results in a byte vs. boolean results packed into a byte, then the 8:1 reduction in global traffic that implies may be a significant performance factor. I think that performance questions like this are really difficult to judge in the absence of a real test case. I'm not sure that generalizations are better than testing and performance benchmarking yourself. – Robert Crovella Dec 24 '16 at 04:36

1 Answers1

4

There is no tradeoff between speed and data size when using the __ballot() warp-voting intrinsic to efficiently pack the results.

Assuming that you can redefine output to be of uint32_t type, and that your block size is a multiple of the warp size (32), you can simply store the packed output using

output[tidx / warpSize] = __ballot(f(input[tidx]));

Note this makes all threads of the warp try to store the result of __ballot(). Only one thread of the warp will succeed, but as their results are all identical, it does not matter which one will.

Community
  • 1
  • 1
tera
  • 7,080
  • 1
  • 21
  • 32