What is the right way to implement a stack push on OpenCL 1.2?

Question

Question

Suppose multiple work-items want to append to a global stack:

void kernel(__global int* stack) {
    ... do stuff ...
    push(stack, value);
    ... do stuff ...
    return y;
}

It is desirable that, after the kernel runs, stack contains every value pushed to it. Order does not matter. What is the proper way to do it in OpenCL 1.2?

What I've tried

An obvious idea would be to use atomic_inc to get the length and just write to it:

void push(__global int* stack, int val) {
    int idx = atomic_inc(stack) + 1; // first element is the stack length
    stack[idx] = val;
}

But I speculate having all work-items call atomic_inc separately on the same memory position ruins the parallelism. A separate idea would be to just write to a temporary array larger than the number of work items:

void push(__global int* stack, int val) {
    stack[get_global_id(0)] = val;
}

That'd leave us with a sparse array of values:

[0, 0, 0, 7, 0, 0, 0, 2, 0, 0, 3, 0, 0, 0, 9, 0, 0, ...]

Which could then be compacted using "stream compaction". I, thus, wonder what of those ideas is the most efficient, and if perhaps there is a third option I'm not aware of.

Do you have any upper or lower bounds on, or any particular way to predict the number of values generated by each work-item or work-group? — pmdj, May 24 '18 at 11:16
@pmdj there is an upper bound equal to 2x the number of work-items, as each will append up to 2 values. The lower bound is 0, eventually all work-items will append nothing (that is when the computation ends). I'd say the most likely case is each work-item appends 1 value. — MaiaVictor, May 24 '18 at 15:26
@pmdj by the way, that array is an array of values-to-be-processed. Each value is processed by a work-item and produces 0-2 new values to be processed. So my plan is to run the kernel in parallel for each item to be processed, collect the new items, and call the kernel again on them, repeatedly until there is nothing to be processed. — MaiaVictor, May 24 '18 at 15:36
@pmdj also, can I use this opportunity to ask a quick question? In order to do what I described, I need to read, on host, the length of the resulting array (so I can call the kernel with the right amount of work-items). Wouldn't the delay of reading 1 uint between every kernel call considerably influence the performance? I expect there will be hundreds of thousands of calls of the kernel per second. — MaiaVictor, May 24 '18 at 15:58
Regarding the last question, from OpenCL 2.0 onwards, you can enqueue kernels *from a running kernel* - another term for this is dynamic parallelism. This avoids the roundtrip to the host. Obviously this only helps on platforms that support 2.0+. On other systems, the best you can probably do is to mask the latency by asynchronously submitting multiple kernel batches and submitting the next stage as soon as the result for any of them comes back. (Or possibly use a different GPGPU API that does support enqueue-from-kernel on those platforms.) — pmdj, May 24 '18 at 16:52
@pmdj would you by chance be (or know anyone who would be) available for OpenCL consulting? Sorry for asking here, but it seems like getting answers on OpenCL questions is hard and that is slowing my progress, but I wouldn't mind paying a professional for teaching and advising. — MaiaVictor, May 24 '18 at 17:03
We do occasionally do consulting work in this area so feel free to get in touch. Contact details are in my profile. I should say it's not currently our primary speciality, and there are companies out there that specialise in exactly this type of work. I obviously don't know what will work out better for you. :-) I haven't worked with any of them directly, but I have heard of https://streamhpc.com/ - hope that helps! — pmdj, May 24 '18 at 17:15

pmdj · Accepted Answer · 2018-05-24T17:18:02.983

I can't give you a definite answer here, but I can make a few suggestions of things to try - if you have the resources, try to implement more than one of them and profile their performance on all the different types of OpenCL implementation you're planning to deploy on. You might find that different solutions perform differently on different hardware/software.

Create a stack per work-group in local memory (either explicitly or by compacting after all values have been generated) and only increment the global stack by the per-work-group count and copy the whole local stack into the global one. This means you only have one global atomic add per work-group. Works better for large groups of course.
Your biggest source of atomic contention in the naive approach will be from items on the same work-group. So you could create as many stacks as items per work group, and have each item in the group submit to its "own" stack. You'll still need a compaction step after this to combine it all into one list. Vary group size if you try this. I'm not sure to what extent current GPUs suffer from false sharing (atomics locking a whole cache line, not just that word) so you'll want to check that and/or experiment with different gaps between stack counters in memory.
Write all results to fixed offsets (based on global id) an array large enough to catch the worst case, and queue a separate compaction kernel that post-processes the result into a contiguous array.
Don't bother with a compact representation of the result. Instead, use the sparse array as the input for the next stage of computation. This next stage's work group can compact a fixed subset of the sparse array into local memory. When that's done, each work item then works on one item of the compacted array. Iterate inside the kernel until all have been processed. How well this works will depend on how predictable the statistical distribution of the sparse items in the array is, and your choice of work group size and how much of the sparse array each work group processes. This version also avoids the round trip to the host processor.
On Intel IGPs specifically, I have heard that DirectX/OpenGL/Vulkan geometry shaders with variable number of outputs perform exceptionally well. If you can write your algorithm in the format of a geometry shader, this might be worth a try if you're targeting those devices. For nvidia/AMD, don't bother with this.

There are probably other options, but those should give you some ideas.

Interesting ideas! (4) sounds specially nice, because it avoids the round trip, but then it doesn't know the size. The number of values to be processed starts low, increases a lot, then decreases til 0, ex: `1, 3, 7, 13, ..., 13111, 20037, 13240, ..., 6, 2, 0`. Suppose I call every kernel with enough work-items for the maximum number of values (very large), but I abort all work-items that have nothing to process (ex: `if (value_to_process[i] == 0) return;`). Do you expect hundreds of work-items that are aborted early to have a big impact on performance? — MaiaVictor, May 24 '18 at 17:21
(And thanks for the many options, I'll be trying all those things on the next days.) — MaiaVictor, May 24 '18 at 17:22
If the hundreds of work items that do nothing are adjacent (e.g. items 0-63, 128-255), and the work items that do important work are reasonably resource-intensive, then the hit on the wasted work items might not be too bad, but without trying it I don't really know if that'll be slower or faster than a round trip to the host. The worst case is if a work-group of e.g. 64 items has exactly 1 item in it that has to do work. — pmdj, May 24 '18 at 17:52
You mentioned the order of the output is not important. Given the large "hump" in the distribution in the middle, you could try method number 4, but instead of each second-stage work group processing e.g. array indices 0..511, make each group take a "sample" across the whole of the previous stage's output. So if you have 64 groups of 64 items, global work item 0 could consider array indices 0, 4096, 8192, 12288, 16384, etc. for compaction, work item 1 would pick 1, 4097, 8193, etc. This should distribute the work more equally across work groups. — pmdj, May 24 '18 at 17:58

What is the right way to implement a stack push on OpenCL 1.2?

Question

What I've tried

1 Answers1