As a school project, we're 4 working on a parallel raytracer with OpenCL. It's our first project using OpenCL, so we might have some incomprehensions about it.
We're trying to implement parallel buffer compaction to remove finished rays, or rays that didn't collide with anything so the next iteration has less data to process.
Basically, we've got a buffer of as many s_ray_states
as needed for rendering, tracing them, getting collision data, compacting the buffer so there'd be only rays which collided with an object inside it, then shading them.
So we have a buffer uint *prefix_sum
which contains the indices at which each s_ray_state
must be moved to in the buffer s_ray_state *ray_states
in order to reduce the number of rays which are sent to the shading kernel, and the next iterations of the trace/shade kernels.
Sadly, the ray_sort
kernel below doesn't seem to be working right, we verified the input prefix_sum
data, which is 100% correct, same for ray_states
buffer, but we're getting unwanted data in the output.
We're launching a single workgroup ( global work size = local work size ), the rays are always moved in the buffer to a smaller index than their original. We've put barriers, and are using the s_ray_state *tmp
buffer to prevent parallel executions to write on each-other's data, but it doesn't seem to work, even when removing the barriers we'd get the same result.
The both of us have been working on it for 4 days and have already asked for help from other students, but no-one seems to be able to figure out what's wrong. We may not be understanding barriers / mem fences enough to be sure this can in fact work.
We already tried making a single work item in a single work group sort the whole array, which works, and even gives better performance.
Is the code below supposed to be working ? With our understanding of OpenCL, it should be working, and we did a lot of research, but never really got any clear answer..
kernel void ray_sort(
global read_only uint *prefix_sum,
global read_write struct s_ray_state *ray_states,
global read_only uint *ray_states_size,
local read_write struct s_ray_state *tmp
)
{
int l_size = get_local_size(0);
int l_id = get_local_id(0);
int group_id = -1;
int group_nb = *ray_states_size / l_size;
int state_id;
while (++group_id < group_nb)
{
state_id = group_id * l_size + l_id;
tmp[l_id] = ray_states[state_id];
barrier(CLK_LOCAL_MEM_FENCE);
if (did_hit(tmp[l_id]))
ray_states[prefix_sum[state_id]] = tmp[l_id];
barrier(CLK_GLOBAL_MEM_FENCE);
}
}
ray_states
length is ray_states_size
prefix_sum
contains indices at which each ray_states
element must be moved to
tmp
is a local buffer of size local_work_size
local_work_size
= global_work_size
did_hit()
returns 1 if the ray hits an object, 0 otherwise
We're expecting the ray_states
elements to be moved to the indices contained in prefix_sum
Example: Each ray_states[id]
gets moved to prefix_sum[id]
index in
ray_states
prefix_sum: 0 | 0 | 1 | 1 | 2 | 3 | 3 | 3 | 4
did_hit(ray_states[id]): 0 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 0
did_hit(output[id]): 1 | 1 | 1 | 1 | X | X | X | X | X
X
s can be anything