I have been implementing various node based binary search trees using C-ish C++ code. When benchmarking these I have noticed surprisingly large performance variations both across compilers and in response to small code changes.
When I focused on insertion and removal in a tree that allowed duplicates (as a C++ std::multiset<int>
would), I found that almost all the time is spent zig-zagging down the tree's left and right pointers in operations like "find" and "lower_bound" rather than the conceptually "expensive" rebalancing steps that occur after inserts and deletes.
So I began to focus on one case in particular: lower bound.
// Node is a binary tree node. It has the
// usual left and right links and an
// integral key.
struct Node {
int key;
Node* links[2];
};
// LowerBound returns the first node in
// the tree rooted at "x" whose key is
// not less than "key", or null if there
// is no such key.
Node* LowerBound(Node* x, int key) {
Node* lower = nullptr;
while (x != nullptr) {
bool x_gte = !(x->key < key);
lower = x_gte ? x : lower;
x = x->links[!x_gte];
}
return lower;
}
A few points and observations:
- I am on an AMD Ryzen 9 5900X 12-Core.
My understanding is that the conditional move ((my understanding was wrong, see Peter Cordes' comment on this post), but I find that when I spot check results on my 8 year old Intel laptop the code that is faster on AMD is faster on Intel too.cmov
) instructions are faster on AMD than on Intel - I am running Linux. I've turned off hyperthreading, boost mode, and set the cpu scaling governor to "performance" using this script I wrote. The performance numbers are stable with little variation.
- The code above is the end of several optimization iterations. I have a benchmark (code here) that exercises various tree sizes, allocating nodes in an array according to either a random or ascending by key order, then writes a key access pattern to another array, and runs through them repeatedly. The key access patterns are either ascending or random. In larger trees, code that uses branches, rather than
cmov
or similar, is often much slower. - One key optimization seems to be using an array of links (
Node links[2]
) in the node instead of explicitleft
andright
pointers. With explicit fields gcc is very quick to switch to branchy code, which is slower. With thelinks
array gcc will index it as I have written. - In fact, when I use gcc's profile guided optimization it still switches to branch based code, for a 1.5x to 2x performance loss.
- In all cases, except for very tiny trees where branchy code can win, clang generates faster code for this function.
With the code above on godbolt we can see clang generating the following:
LowerBound(Node*, int):
xorl %eax, %eax
testq %rdi, %rdi
je .LBB0_3
.LBB0_1: # =>This Inner Loop Header: Depth=1
xorl %ecx, %ecx
cmpl %esi, (%rdi)
setl %cl
cmovgeq %rdi, %rax
movq 8(%rdi,%rcx,8), %rdi
testq %rdi, %rdi
jne .LBB0_1
.LBB0_3:
retq
while gcc is doing worse:
LowerBound(Node*, int):
xorl %eax, %eax
testq %rdi, %rdi
je .L5
.L4:
cmpl %esi, (%rdi)
setl %dl
cmovge %rdi, %rax
movzbl %dl, %edx
movq 8(%rdi,%rdx,8), %rdi
testq %rdi, %rdi
jne .L4
ret
.L5:
ret
The gcc variant is roughly 2x slower on my machine (the geomean of the timings with tree heights 1 to 18). Can this be explained in a simple way? I notice that clang is clearing %ecx
first, then sets %cl
, then uses %ecx
, whereas gcc sets %dl
and then moves it to %edx
before using %rdx
.
gcc's approach is equivalent logically, much slower in practice. Can it be improved?