Question
Suppose multiple work-items want to append to a global stack:
void kernel(__global int* stack) {
... do stuff ...
push(stack, value);
... do stuff ...
return y;
}
It is desirable that, after the kernel runs, stack
contains every value
pushed to it. Order does not matter. What is the proper way to do it in OpenCL 1.2?
What I've tried
An obvious idea would be to use atomic_inc
to get the length and just write to it:
void push(__global int* stack, int val) {
int idx = atomic_inc(stack) + 1; // first element is the stack length
stack[idx] = val;
}
But I speculate having all work-items call atomic_inc
separately on the same memory position ruins the parallelism. A separate idea would be to just write to a temporary array larger than the number of work items:
void push(__global int* stack, int val) {
stack[get_global_id(0)] = val;
}
That'd leave us with a sparse array of values:
[0, 0, 0, 7, 0, 0, 0, 2, 0, 0, 3, 0, 0, 0, 9, 0, 0, ...]
Which could then be compacted using "stream compaction". I, thus, wonder what of those ideas is the most efficient, and if perhaps there is a third option I'm not aware of.