I am working on a path tracer using vulkan compute shaders. I implemented a tree representing a bounding volume hierachy. The idea of the BVH is to minimize the amount of objects a ray intersection test needs to be performed on.
#1 Naive Implementation
My first implementation is very fast, it traverses the tree down to a single leaf of the BVH tree. However, the ray might intersect multiple leaves. This code then leads to some triangles not being rendered (although they should).
int box_index = -1;
for (int i = 0; i < boxes_count; i++) {
// the first box has no parent, boxes[0].parent is set to -1
if (boxes[i].parent == box_index) {
if (intersect_box(boxes[i], ray)) {
box_index = i;
}
}
}
if (box_index > -1) {
uint a = boxes[box_index].ids_offset;
uint b = a + boxes[box_index].ids_count;
for (uint j = a; j < b; j++) {
uint triangle_id = triangle_references[j];
// triangle intersection code ...
}
}
#2 Multi-Leaf Implementation
My second implementation accounts for the fact that multiple leaves might be intersected. However, this implementation is 36x slower than implementation #1 (okay, I miss some intersection tests in #1, but still...).
bool[boxes.length()] hits;
hits[0] = intersect_box(boxes[0], ray);
for (int i = 1; i < boxes_count; i++) {
if (hits[boxes[i].parent]) {
hits[i] = intersect_box(boxes[i], ray);
} else {
hits[i] = false;
}
}
for (int i = 0; i < boxes_count; i++) {
if (!hits[i]) {
continue;
}
// only leaves have ids_offset and ids_count defined (not set to -1)
if (boxes[i].ids_offset < 0) {
continue;
}
uint a = boxes[i].ids_offset;
uint b = a + boxes[i].ids_count;
for (uint j = a; j < b; j++) {
uint triangle_id = triangle_references[j];
// triangle intersection code ...
}
}
This performance difference drives me crazy. It seems only having a single statement like if(dynamically_modified_array[some_index])
has a huge impact on performance. I suspect that the SPIR-V or GPU compiler is no longer able to do its optimization magic? So here are my questions:
Is this indeed an optimization problem?
If yes, can I transform implementation #2 to be better optimizable? Can I somehow give optimization hints?
Is there a standard way to implement BVH tree queries in shaders?