OpenCL parallel buffer compaction barrier issue

Question

As a school project, we're 4 working on a parallel raytracer with OpenCL. It's our first project using OpenCL, so we might have some incomprehensions about it.

We're trying to implement parallel buffer compaction to remove finished rays, or rays that didn't collide with anything so the next iteration has less data to process. Basically, we've got a buffer of as many s_ray_states as needed for rendering, tracing them, getting collision data, compacting the buffer so there'd be only rays which collided with an object inside it, then shading them.

So we have a buffer uint *prefix_sum which contains the indices at which each s_ray_state must be moved to in the buffer s_ray_state *ray_states in order to reduce the number of rays which are sent to the shading kernel, and the next iterations of the trace/shade kernels.

Sadly, the ray_sort kernel below doesn't seem to be working right, we verified the input prefix_sum data, which is 100% correct, same for ray_states buffer, but we're getting unwanted data in the output.

We're launching a single workgroup ( global work size = local work size ), the rays are always moved in the buffer to a smaller index than their original. We've put barriers, and are using the s_ray_state *tmp buffer to prevent parallel executions to write on each-other's data, but it doesn't seem to work, even when removing the barriers we'd get the same result.

The both of us have been working on it for 4 days and have already asked for help from other students, but no-one seems to be able to figure out what's wrong. We may not be understanding barriers / mem fences enough to be sure this can in fact work.

We already tried making a single work item in a single work group sort the whole array, which works, and even gives better performance.

Is the code below supposed to be working ? With our understanding of OpenCL, it should be working, and we did a lot of research, but never really got any clear answer..

kernel void ray_sort(
    global read_only uint *prefix_sum,
    global read_write struct s_ray_state *ray_states,
    global read_only uint *ray_states_size,
    local read_write struct s_ray_state *tmp
)
{
    int l_size = get_local_size(0);
    int l_id = get_local_id(0);
    int group_id = -1;
    int group_nb = *ray_states_size / l_size;
    int state_id;

    while (++group_id < group_nb)
    {
        state_id = group_id * l_size + l_id;
        tmp[l_id] = ray_states[state_id];
        barrier(CLK_LOCAL_MEM_FENCE);
        if (did_hit(tmp[l_id]))
            ray_states[prefix_sum[state_id]] = tmp[l_id];
        barrier(CLK_GLOBAL_MEM_FENCE);
    }
}

ray_states length is ray_states_size

prefix_sum contains indices at which each ray_states element must be moved to

tmp is a local buffer of size local_work_size

local_work_size = global_work_size

did_hit() returns 1 if the ray hits an object, 0 otherwise

We're expecting the ray_states elements to be moved to the indices contained in prefix_sum

Example: Each ray_states[id] gets moved to prefix_sum[id] index in ray_states

prefix_sum: 0 | 0 | 1 | 1 | 2 | 3 | 3 | 3 | 4

did_hit(ray_states[id]): 0 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 0

did_hit(output[id]): 1 | 1 | 1 | 1 | X | X | X | X | X

Xs can be anything

Hi elXor, i feel you need to edit the question to elaborate on " we're getting unwanted data in the output." -- i.e. " we're expecting (something specific) and getting (something specific)" -- so people can be laser-focused on the issue. — Leon Bambrick, Mar 21 '18 at 22:54
@LeonBambrick I feel like the output we're getting isn't significant for the question, we're just expecting the elements from ray_states to be re-organized to the indices contained in prefix_sum. We're just getting something that's not reorganized the way it is expected to ( elements are not moved to the right inidices ). — elXor, Mar 21 '18 at 23:02

mogu · Answer 1 · 2018-03-22T18:19:45.050

I could be completely off here, but it seems to me with did_hit(ray_states[state_id]) you are reading the same piece of global memory which you put into a local memory buffer tmp just 2 lines above. Which wouldn't be a problem, except you are using that buffer for both input and output.

The way i see it, what actually happens on the hardware is this:

    tmp[l_id] = ray_states[state_id];
    tmp[l_id] = ray_states[state_id];
    tmp[l_id] = ray_states[state_id];
    tmp[l_id] = ray_states[state_id];
    tmp[l_id] = ray_states[state_id];

       ... local-work-size times

    barrier(CLK_LOCAL_MEM_FENCE);

    if (did_hit(ray_states[state_id]))
        ray_states[prefix_sum[state_id]] = tmp[l_id];
    if (did_hit(ray_states[state_id]))
        ray_states[prefix_sum[state_id]] = tmp[l_id];
    if (did_hit(ray_states[state_id]))
        ray_states[prefix_sum[state_id]] = tmp[l_id];
    if (did_hit(ray_states[state_id]))
        ray_states[prefix_sum[state_id]] = tmp[l_id];

      ... again local-work-size times

Considering WItem parallel execution order is undefined (hardware may chose any order it wants), this will lead to random results. Can you try this instead:

    if (did_hit(tmp[l_id]))
        ray_states[prefix_sum[state_id]] = tmp[l_id];

BTW if ray_states_size is just a simple integer, you can pass it directly, by making the argument "uint ray_states_size". No need to fool around with buffers there.

EDIT1: my suggestion will only work if prefix_sum[state_id] doesn't have any duplicates in every local-work-size worth of ids, otherwise there will still be a data race. So e.g. if for both state_id-s 1 and 3 the prefix_sum[state_id] array has 0, and your local WG size is >= 4, there will be a data race.

Also, is there some really good reason you must use the same buffer for input and output ? it seems to me it'd be a lot less complicated if you had separate in/out buffers.

EDIT2: i just noticed you said "the rays are always moved in the buffer to a smaller index than their original" (sorry i missed it). That's good, but not enough - are they always moved to a smaller index then any other ray's index in the same local WG ? if yes, good, but there's still the other data race i mentioned.

I fixed the problem with did_hit() you mentioned ( both in implementation and in my question), sadly, still didn't change anything.. — elXor, Mar 22 '18 at 20:41
I did't knew you could pass by value to the kernel, even in some examples I've seen they were passing by pointer, will look into that, thanks. `prefix_sum` does have duplicates, but `did_hit(state_id)` can only return 1 for one of them, never more, never less. I can't use 2 different buffers as it would take far too much memory on the GPU we have at school. I tried with 2 buffers, it didn't work either tho.. — elXor, Mar 22 '18 at 20:51
Rays are always moved to a smaller index than their original, or not moved at all. They might be moved to the same index as other rays in the same local workgroup, but isn't the `tmp` local supposed to care about that ? — elXor, Mar 22 '18 at 20:52
the `tmp` local will give you a valid readable copy, so writing into `ray_states` afterwards is safe. But the question is what ends up written. If the `did_hit(tmp[l_id])` condition evaluates to true for any two different WItems of a local WG, can `prefix_sum[state_id]` for those WItems be the same ? If yes, you have a problem: what ends up stored is what the last of those two WItem writes, but you don't know which one executes last. — mogu, Mar 22 '18 at 23:52

OpenCL parallel buffer compaction barrier issue

1 Answers1