Traversal of Bounding Volume Hierachy in Shaders

Question

I am working on a path tracer using vulkan compute shaders. I implemented a tree representing a bounding volume hierachy. The idea of the BVH is to minimize the amount of objects a ray intersection test needs to be performed on.

#1 Naive Implementation

My first implementation is very fast, it traverses the tree down to a single leaf of the BVH tree. However, the ray might intersect multiple leaves. This code then leads to some triangles not being rendered (although they should).

int box_index = -1;

for (int i = 0; i < boxes_count; i++) {
    // the first box has no parent, boxes[0].parent is set to -1
    if (boxes[i].parent == box_index) {
        if (intersect_box(boxes[i], ray)) {
            box_index = i;
        }
    }
}

if (box_index > -1) {
    uint a = boxes[box_index].ids_offset;
    uint b = a + boxes[box_index].ids_count;

    for (uint j = a; j < b; j++) {
        uint triangle_id = triangle_references[j];
        // triangle intersection code ...
    }
}

#2 Multi-Leaf Implementation

My second implementation accounts for the fact that multiple leaves might be intersected. However, this implementation is 36x slower than implementation #1 (okay, I miss some intersection tests in #1, but still...).

bool[boxes.length()] hits;
hits[0] = intersect_box(boxes[0], ray);

for (int i = 1; i < boxes_count; i++) {
    if (hits[boxes[i].parent]) {
        hits[i] = intersect_box(boxes[i], ray);
    } else {
        hits[i] = false;
    }
}

for (int i = 0; i < boxes_count; i++) {
    if (!hits[i]) {
        continue;
    }

    // only leaves have ids_offset and ids_count defined (not set to -1)
    if (boxes[i].ids_offset < 0) {
        continue;
    }

    uint a = boxes[i].ids_offset;
    uint b = a + boxes[i].ids_count;

    for (uint j = a; j < b; j++) {
        uint triangle_id = triangle_references[j];
        // triangle intersection code ...
    }
}

This performance difference drives me crazy. It seems only having a single statement like if(dynamically_modified_array[some_index]) has a huge impact on performance. I suspect that the SPIR-V or GPU compiler is no longer able to do its optimization magic? So here are my questions:

Is this indeed an optimization problem?
If yes, can I transform implementation #2 to be better optimizable? Can I somehow give optimization hints?
Is there a standard way to implement BVH tree queries in shaders?

jns · Accepted Answer · 2021-11-25T19:52:25.397

After some digging, I found a solution. Important to understand is that the BVH tree does not exclude the possibility that one needs to evaluate all leaves.

Implementation #3 below, uses hit and miss links. The boxes need to be sorted in a way that in the worst case all of them are queried in the correct order (so a single loop is enough). However, links are used to skip nodes which don't need to be evaluated. When the current node is a leaf node, the actual triangle intersections are performed.

hit link ~ which node to jump to in case of a hit (green below)
miss link ~ which node to jump to in case of a miss (red below)

Image taken from here. The associated paper and source code is also on Prof. Toshiya Hachisuka's page. The same concept is also described in this paper referenced in the slides.

#3 BVH Tree with Hit and Miss Links

I had to extend the data which is pushed to the shader with the links. Also some offline fiddling was required to store the tree correctly. At first I tried using a while loop (loop until box_index_next is -1) which resulted in a crazy slowdown again. Anyway, the following works reasonably fast:

int box_index_next = 0;

for (int box_index = 0; box_index < boxes_count; box_index++) {
    if (box_index != box_index_next) {
        continue;
    }

    bool hit = intersect_box(boxes[box_index], ray);
    bool leaf = boxes[box_index].ids_count > 0;

    if (hit) {
        box_index_next = boxes[box_index].links.x; // hit link
    } else {
        box_index_next = boxes[box_index].links.y; // miss link
    }

    if (hit && leaf) {
        uint a = boxes[box_index].ids_offset;
        uint b = a + boxes[box_index].ids_count;

        for (uint j = a; j < b; j++) {
            uint triangle_id = triangle_references[j];
            // triangle intersection code ...
        }
    }
}

This code is about 3x slower than the fast, but flawed implementation #1. This is somewhat expected, now the speed depends on the actual tree, not on the gpu optimization. Consider, for example, a degenerate case where triangles are aligned along an axis: a ray in the same direction might intersect with all triangles, then all tree leaves need to be evaluated.

Prof. Toshiya Hachisuka proposes a further optimization for such cases in his sildes (page 36 and onward): One stores multiple versions of the BVH tree, spatially sorted along x, -x, y, -y, z and -z. For traversal the correct version needs to be selected based on the ray. Then one can stop the traversal as soon as a triangle from a leaf is intersected, since all remaining nodes to be visited will be spatially behind this node (from the ray point of view).

Once the BVH tree is built, finding the links is quite straightforward (some python code below):

class NodeAABB(object):

    def __init__(self, obj_bounds, obj_ids):
        self.children = [None, None]
        self.obj_bounds = obj_bounds
        self.obj_ids = obj_ids

    def split(self):
        # split recursively and create children here
        raise NotImplementedError()

    def is_leaf(self):
        return set(self.children) == {None}

    def build_links(self, next_right_node=None):
        if not self.is_leaf():
            child1, child2 = self.children

            self.hit_node = child1
            self.miss_node = next_right_node

            child1.build_links(next_right_node=child2)
            child2.build_links(next_right_node=next_right_node)

        else:
            self.hit_node = next_right_node
            self.miss_node = self.hit_node

    def collect(self):
        # retrieve in depth first fashion for correct order
        yield self
        if not self.is_leaf():
            child1, child2 = self.children
            yield from child1.collect()
            yield from child2.collect()

After you store all AABBs in an array (which will be sent to the GPU) you can use hit_node and miss_node to look up the indices for the links and store them as well.

Can you tell me how you calculated the links.x and links.y during the tree construction? — Fox1942, May 11 '20 at 16:38
I added some sample code, hope that helps. In my case links is an ivec2. — jns, May 12 '20 at 19:26
In case of defining the next_right_node of a node, how can you determine the position in the array? I can't see where it is. — Fox1942, May 19 '20 at 15:43
I did not write that in the code above. In build_links the respective node objects are stored as reference. Once this is done put all nodes in an array with collect(), afterwards you can determine the index, e.g.: all_my_nodes.index(some_node.hit_node) — jns, May 20 '20 at 20:47

Traversal of Bounding Volume Hierachy in Shaders

1 Answers1

Linked