105

I've got a performance critical binary decision tree, and I'd like to focus this question on a single line of code. The code for the binary tree iterator is below with the results from running performance analysis against it.

        public ScTreeNode GetNodeForState(int rootIndex, float[] inputs)
        {
0.2%        ScTreeNode node = RootNodes[rootIndex].TreeNode;

24.6%       while (node.BranchData != null)
            {
0.2%            BranchNodeData b = node.BranchData;
0.5%            node = b.Child2;
12.8%           if (inputs[b.SplitInputIndex] <= b.SplitValue)
0.8%                node = b.Child1;
            }

0.4%        return node;
        }

BranchData is a field, not a property. I did this to prevent the risk of it not being inlined.

The BranchNodeData class is as follows:

public sealed class BranchNodeData
{
    /// <summary>
    /// The index of the data item in the input array on which we need to split
    /// </summary>
    internal int SplitInputIndex = 0;

    /// <summary>
    /// The value that we should split on
    /// </summary>
    internal float SplitValue = 0;

    /// <summary>
    /// The nodes children
    /// </summary>
    internal ScTreeNode Child1;
    internal ScTreeNode Child2;
}

As you can see, the while loop / null check is a massive hit on performance. The tree is massive, so I would expect searching for a leaf to take a while, but I'd like to understand the disproportionate amount of time spent on that one line.

I've tried:

  • Separating the Null check from the while - it's the Null check that's the hit.
  • Adding a boolean field to the object and checking against that, it made no difference. It doesn't matter what's being compared, it's the comparison that's the issue.

Is this a branch prediction issue? If so, what can I do about it? If anything?

I won't pretend to understand the CIL, but I'll post it for anyone does so they can try to scrape some information from it.

.method public hidebysig
instance class OptimalTreeSearch.ScTreeNode GetNodeForState (
    int32 rootIndex,
    float32[] inputs
) cil managed
{
    // Method begins at RVA 0x2dc8
    // Code size 67 (0x43)
    .maxstack 2
    .locals init (
        [0] class OptimalTreeSearch.ScTreeNode node,
        [1] class OptimalTreeSearch.BranchNodeData b
    )

    IL_0000: ldarg.0
    IL_0001: ldfld class [mscorlib]System.Collections.Generic.List`1<class OptimalTreeSearch.ScRootNode> OptimalTreeSearch.ScSearchTree::RootNodes
    IL_0006: ldarg.1
    IL_0007: callvirt instance !0 class [mscorlib]System.Collections.Generic.List`1<class OptimalTreeSearch.ScRootNode>::get_Item(int32)
    IL_000c: ldfld class OptimalTreeSearch.ScTreeNode OptimalTreeSearch.ScRootNode::TreeNode
    IL_0011: stloc.0
    IL_0012: br.s IL_0039
    // loop start (head: IL_0039)
        IL_0014: ldloc.0
        IL_0015: ldfld class OptimalTreeSearch.BranchNodeData OptimalTreeSearch.ScTreeNode::BranchData
        IL_001a: stloc.1
        IL_001b: ldloc.1
        IL_001c: ldfld class OptimalTreeSearch.ScTreeNode OptimalTreeSearch.BranchNodeData::Child2
        IL_0021: stloc.0
        IL_0022: ldarg.2
        IL_0023: ldloc.1
        IL_0024: ldfld int32 OptimalTreeSearch.BranchNodeData::SplitInputIndex
        IL_0029: ldelem.r4
        IL_002a: ldloc.1
        IL_002b: ldfld float32 OptimalTreeSearch.BranchNodeData::SplitValue
        IL_0030: bgt.un.s IL_0039

        IL_0032: ldloc.1
        IL_0033: ldfld class OptimalTreeSearch.ScTreeNode OptimalTreeSearch.BranchNodeData::Child1
        IL_0038: stloc.0

        IL_0039: ldloc.0
        IL_003a: ldfld class OptimalTreeSearch.BranchNodeData OptimalTreeSearch.ScTreeNode::BranchData
        IL_003f: brtrue.s IL_0014
    // end loop

    IL_0041: ldloc.0
    IL_0042: ret
} // end of method ScSearchTree::GetNodeForState

Edit: I decided to do a branch prediction test, I added an identical if within the while, so we have

while (node.BranchData != null)

and

if (node.BranchData != null)

inside that. I then ran performance analysis against that, and it took six times longer to execute the first comparison as it did to execute the second comparison that always returned true. So it looks like it is indeed a branch prediction issue - and I'm guessing there's nothing I can do about it?!

Another Edit

The above result would also occur if node.BranchData had to be loaded from the RAM for the while check - it would then be cached for the if statement.


This is my third question on a similar topic. This time I'm focusing on a single line of code. My other questions on this subject are:

Community
  • 1
  • 1
Will Calderwood
  • 4,393
  • 3
  • 39
  • 64
  • 3
    Please show the implementation of the `BranchNode` property. Please try to replace `node.BranchData != null` `ReferenceEquals(node.BranchData, null)`. Does it make any difference? – Daniel Hilgarth May 15 '13 at 09:21
  • 4
    Are you sure that the 24% are not for the while statement and not the condition expression that part of the while statement – Rune FS May 15 '13 at 09:22
  • 2
    Another test: Try to re-write your while loop like this: `while(true) { /* current body */ if(node.BranchData == null) return node; }`. Does it change anything? – Daniel Hilgarth May 15 '13 at 09:24
  • I wasn't asking for the `BranchNodeData` class. I was asking for the `BranchNode` property. – Daniel Hilgarth May 15 '13 at 09:25
  • @Danial I've added the BranchNodeData class. I've tried the while (true) and separated out the if. It made performance marginally worse. I'll try the ReferenceEquals and see if that changes anything. – Will Calderwood May 15 '13 at 09:26
  • 1
    If your tree is extremely sparse (i.e. the majority of nodes don't have `BranchData`) then wouldn't it be logical for the `null` check to take up most of the execution time? – Jon May 15 '13 at 09:27
  • @WillCalderwood: As I said, I wasn't asking for that class. I was asking for the property. – Daniel Hilgarth May 15 '13 at 09:27
  • @Rune FS It's explicitly not the while that's the problem it's the Null check. – Will Calderwood May 15 '13 at 09:27
  • 2
    A little optimization would be the following: `while(true) { BranchNodeData b = node.BranchData; if(ReferenceEquals(b, null)) return node; node = b.Child2; if (inputs[b.SplitInputIndex] <= b.SplitValue) node = b.Child1; }` This would retrieve `node. BranchData` only once. – Daniel Hilgarth May 15 '13 at 09:28
  • 1
    @DanielHilgarth It's not a property, it's a field to avoid the risk of it not being inlined. – Will Calderwood May 15 '13 at 09:28
  • @Jon It's a deep tree, I don't understand the disproportionate amount of time compared to everything else in the loop. – Will Calderwood May 15 '13 at 09:34
  • 2
    Please add the number of times the two lines with the biggest time consumption are executed in total. – Daniel Hilgarth May 15 '13 at 09:42
  • 1
    I'm guessing it's normal, because you code actually mainly does null checks. This entirely depends on your data (i.e. whether value is actually null or not), but if (say) 95% of values are null, then the result wouldn't be surprising. – ken2k May 15 '13 at 09:49
  • 1
    @DanielHilgarth On my last run the null check was executed 62,372 times, and the other if was executed 32,448 times – Will Calderwood May 15 '13 at 09:52
  • @DanielHilgarth I tried your optimisation. On a few test runs the average execution time was 8233ms compared to 7205ms for the above. – Will Calderwood May 15 '13 at 09:55
  • @DanielHilgarth I just tried your other ReferenceEquals(node.BranchData, null), which I wasn't holding out much hope for. A massive performance improvement! As I said before, the code above too 7205ms, and with ReferenceEquals it takes 1660ms. Any idea's why?! Edit: Ignore that, doing some more tests. – Will Calderwood May 15 '13 at 10:02
  • @DanielHilgarth Sorry, ignore that last comment, my mistake. The ReferenceEquals code takes 7779ms – Will Calderwood May 15 '13 at 10:07
  • http://stackoverflow.com/questions/14278595/null-pointer-test-performance Jon skeet stated that nullity check is a very cheap operation. Maybe stupid point but have you tried to enable "Optimize Code" in the build section of the project's properties? – codingadventures May 15 '13 at 10:08
  • @WillCalderwood: The number of executions perfectly matches the results: The null check takes double the total time the if takes. And it is executed twice as much. So I guess there is really nothing you can do. – Daniel Hilgarth May 15 '13 at 10:09
  • @JohnField All optimised, running in release mode outside of the debugger. – Will Calderwood May 15 '13 at 10:09
  • @DanielHilgarth Any ideas why the ifs are so slow, is this a branch prediction issue? – Will Calderwood May 15 '13 at 10:11
  • I don't think they are slow. Each if takes about 0.03 milliseconds: 8000ms total execution time * 24% for the while loop condition: The condition takes 1920 milliseconds. It is executed 62,000 times: 1920 / 62,000 = 0.03 milliseconds per single execution of an if. – Daniel Hilgarth May 15 '13 at 10:14
  • @DanielHilgarth those times were running my timed test rather than performance analysis which I run on the full application. In my timed test that line is only executed 1,560 times. So we're looking at over 1ms per execution. See my update at the bottom of the question regarding branch prediction. – Will Calderwood May 15 '13 at 10:30
  • About your latest addition: The second if potentially has been removed by the optimizer because it is always true. In any case, putting it inside the while means that is not executed the same number of times the condition in the while loop is executed. – Daniel Hilgarth May 15 '13 at 10:32
  • @WillCalderwood: With an attached debugger huge performance drops are to be expected. – Daniel Hilgarth May 15 '13 at 10:33
  • @DanielHilgarth I don't run any of my timed tests with the debugger attached. It looks to me like Hans Passant has probably hit the nail on the head. – Will Calderwood May 15 '13 at 10:38
  • @WillCalderwood: I meant to say *profiler*, sorry. – Daniel Hilgarth May 15 '13 at 10:39
  • 1
    This can be a normal situation if this function is called only some times (i.e. 10-20). How many times each line of code is called? The better way would be to check the real disassembly of function (not IL). – Nickolay Olshevsky May 15 '13 at 09:32
  • This probably should be a comment. – ken2k May 15 '13 at 09:40
  • 1
    The function was called 99,857 times in that performance test run. What's the best way to get a disassembly from from the IL? – Will Calderwood May 15 '13 at 09:41
  • As far as I remember that should be possible via disassembly view – Nickolay Olshevsky May 15 '13 at 09:43
  • 1
    A little unrelated to your specific question, but might be valuable. Our experience with MASSIVE trees in .net (> 12GiB) is not a very fond one. Whenever we had full garbage collection cycles, our service would completely stall for several seconds, very often. We rewrote our tree to use Lists of structs (internally dynamic arrays), with references becoming indexes to the Lists (we used several Lists to overcome the 2GiB limit that existed on older versions of .net), where a range of bits in the index acted as a selector index to the desired sublist. This COMPLETELY removed the GC pauses. =) –  May 21 '13 at 20:24
  • 1
    @sgorozco Thanks for that. I've already gone a similar route to help with cache locality. I'm pre-creating massive arrays of structs and use indexes instead of references. Effectively the same as your list solution except I can't grow them. I thought about lists but decided I'd be fine with a fixed size and trying to grow an xGB list might be painful. I achieved a 20% speed increase going that route. – Will Calderwood May 21 '13 at 21:03
  • @WillCalderwood - You're wellcome. Glad to hear about your speed increase! =) –  May 21 '13 at 21:22

3 Answers3

184

The tree is massive

By far the most expensive thing a processor ever does is not executing instructions, it is accessing memory. The execution core of a modern CPU is many times faster than the memory bus. A problem related to distance, the further an electrical signal has to travel, the harder it gets to get that signal delivered to the other end of the wire without it being corrupted. The only cure for that problem is to make it go slower. A big problem with the wires that connect the CPU to the RAM in your machine, you can pop the case and see the wires.

Processors have a counter-measure for this problem, they use caches, buffers that store a copy of the bytes in RAM. An important one is the L1 cache, typically 16 kilobytes for data and 16 kilobytes for instructions. Small, allowing it to be close to the execution engine. Reading bytes from the L1 cache typically takes 2 or 3 CPU cycles. Next up is the L2 cache, bigger and slower. Upscale processors also have an L3 cache, bigger and slower yet. As process technology improves, those buffers take less space and automatically becomes faster as they get closer to the core, a big reason why newer processors are better and how they manage to use an ever increasing number of transistors.

Those caches are however not a perfect solution. The processor will still stall on a memory access if the data is not available in one of the caches. It cannot continue until the very slow memory bus has supplied the data. Losing a fat hundred CPU cycles is possible on a single instruction.

Tree structures are a problem, they are not cache friendly. Their nodes tend to be scattered throughout the address space. The fastest way to access memory is by reading from sequential addresses. The unit of storage for the L1 cache is 64 bytes. Or in other words, once the processor reads one byte, the next 63 are very fast since they'll be present in the cache.

Which makes an array by far the most efficient data structure. Also the reason that the .NET List<> class isn't a list at all, it uses an array for storage. The same for other collection types, like Dictionary, structurally not remotely similar to an array, but internally implemented with arrays.

So your while() statement is very likely to be suffering from CPU stalls because it is dereferencing a pointer to access the BranchData field. The next statement is very cheap because the while() statement already did the heavy lifting of retrieving the value from memory. Assigning the local variable is cheap, a processor uses a buffer for writes.

Not otherwise a simple problem to solve, flattening your tree into arrays is very likely to be unpractical. Not in the least because you typically cannot predict in which order the nodes of the tree will be visited. A red-black tree might help, it isn't clear from the question. So a simple conclusion to draw is that it is already running as fast you can hope for. And if you need it to go faster then you'll need better hardware with a faster memory bus. DDR4 is going mainstream this year.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Hans Passant
  • 922,412
  • 146
  • 1,693
  • 2,536
  • Thanks for that... A very interesting read. As my other linked question stated, I was hoping to find a better data structure. I guess I keep looking down that path. – Will Calderwood May 15 '13 at 10:39
  • I separated out the LeafData and BranchData from the node itself to save memory. I'm guessing I'd get a performance boost by incorporating the branch data into the node itself as that'll ensure it's all in the same area of memory. Is this correct? – Will Calderwood May 15 '13 at 11:00
  • 1
    Maybe. They are very likely to already be adjacent in memory, and thus in the cache, since you allocated one after the other. With the GC heap compacting algorithm otherwise having an unpredictable affect on that. Best to not let me guess at this, *measure* so you know a fact. – Hans Passant May 15 '13 at 11:21
  • Caches are tricky. Red-black trees can be cache-inefficient. Try custom-allocating your nodes so you know they are all contiguous (near each other, at least) in memory. Will only one thread be accessing the tree at once? Multiple threads accessing nearby areas of memory (and updating them) can cause cache issues too. – David May 15 '13 at 13:38
  • Also, the suggestion of another data structure may help. What is the purpose of this tree? For example, a trie can grow very large but can be extremely cache-efficient in some cases - it really depends on the sort of data and what you're trying to do with it. (My comment about about RB trees is due to some research about memory allocators, where pointer lookups in a table/tree/data structure need to be as fast as possible; some allocators store pointer lookups in a trie not a red-black tree for this reason.) – David May 15 '13 at 13:41
  • @DavidM As you say, caches are tricky. Especially with a tree as each layer obviously increases in size exponentially. I'm working on a new method of machine learning for complex datasets. The tree is part of a feedback loop and continually grows and updates itself which is why it gets massive. I'm writing it to run with as many threads as the machine can provide. I'm trying to work out if I can create some kind of index to lookup the nodes - can't quite figure it out though. – Will Calderwood May 15 '13 at 14:33
  • 12
    Threads don't solve this problem. Gives you more cores, you still have only one memory bus. – Hans Passant May 15 '13 at 14:58
  • 2
    Maybe using b-tree will limit the height of the tree, so you will need to access to less pointers, as each node is a single structure so it can be stored efficiently in the cache. See also [this question](http://stackoverflow.com/questions/647537/b-tree-faster-than-avl-or-redblack-tree). – MatthieuBizien May 18 '13 at 22:54
  • 4
    deep explanatory with wide range of related information, as usual. +1 – Tigran May 21 '13 at 18:59
  • +1 Great answer. Besides the memory access benefits explained @HansPassant - for a massive tree, modelling the tree nodes as a `List` of structs (a dynamic array) also has an extremely good performance benefit regarding garbage collection .Net optimizes such a List and stores the structs directly on the dynamic array memory, relieving the garbage collector from the task of tracking potentially millions of node references. This had a *dramatic* performance increase in one of our services. –  May 21 '13 at 20:32
  • 1
    If you know the access pattern to the tree, and it follows the 80/20 (80% of the access is always on the same 20% of the nodes) rule, a self adjusting tree like a splay tree might also prove faster. http://en.wikipedia.org/wiki/Splay_tree – Jens Timmerman May 22 '13 at 12:37
10

To complement Hans' great answer about memory cache effects, I add a discussion of virtual memory to physical memory translation and NUMA effects.

With virtual memory computer (all current computer), when doing a memory access, each virtual memory address must be translated to a physical memory address. This is done by the memory management hardware using a translation table. This table is managed by the operating system for each process and it is itself stored in RAM. For each page of virtual memory, there is an entry in this translation table mapping a virtual to a physical page. Remember Hans' discussion about memory accesses that are expensive: if each virtual to physical translation needs a memory lookup, all memory access would cost twice as much. The solution is to have a cache for the translation table which is called the translation lookaside buffer (TLB for short). TLB are not large (12 to 4096 entries), and typical page size on x86-64 architecture are only 4 KB, which means that there is at most 16 MB directly accessible with TLB hits (it is probably even less than that, the Sandy Bridge having a TLB size of 512 items). To reduce the number of TLB misses, you can have the operating system and the application work together to use a larger page size like 2 MB, leading to a much larger memory space accessible with TLB hits. This page explain how to use large pages with Java which can greatly speedup memory accesses.

If your computer has many sockets, it is probably a NUMA architecture. NUMA means Non-Uniform Memory Access. In these architectures, some memory accesses cost more than others. As an example, with a 2 socket computer with 32 GB of RAM, each socket probably has 16 GB of RAM. On this example computer, local memory accesses are cheaper than accesses to memory of another socket (remote access are 20 to 100% slower, maybe even more). If on such computer, your tree uses 20 GB of RAM, at least 4 GB of your data is on the other NUMA node, and if accesses are 50% slower for remote memory, NUMA accesses slow down your memory accesses by 10%. In addition, if you only have free memory on a single NUMA node, all the processes needing memory on the starved node will be allocated memory from the other node which accesses are more expensive. Even worst, the operating system could think it is a good idea to swap out part of the memory of the starved node, which would cause even more expensive memory accesses. This is explained in more details in The MySQL “swap insanity” problem and the effects of the NUMA architecture where some solutions are given for Linux (spreading memory accesses on all NUMA nodes, biting the bullet on remote NUMA accesses to avoid swapping). I can also think of allocating more RAM to a socket (24 and 8 GB instead of 16 and 16 GB) and making sure your program is schedule on the larger NUMA node, but this needs physical access to the computer and a screwdriver ;-).

jfg956
  • 16,077
  • 4
  • 26
  • 34
4

This is not an answer per se but rather an emphasis on what Hans Passant wrote about delays in the memory system.

Really high performance software - such as computer games - is not only written to implement the game itself, it is also adapted such that code and data structures make the most of the cache and memory systems i e treat them as a limited resource. When I deal with cache issues I typically assume that the L1 will deliver in 3 cycles if the data is present there. If it isn't and I have to go to L2 I assume 10 cycles. For L3 30 cycles and for RAM memory 100.

There is an additional memory-related action that - if you need to use it - imposes an even greater penalty and that is a bus lock. Bus locks are called critical sections if you use the Windows NT functionality. If you use a home-grown variety you might call it a spinlock. Whatever the name it synchronizes down to the slowest bus-mastering device in the system before the lock is in place. The slowest bus-mastering device might be a classic 32-bit PCI card connected @ 33MHz. 33MHz is one hundredth of the frequency of a typical x86 CPU (@ 3.3 GHz). I assume not less than 300 cycles to complete a bus lock but I know they can take many times that long so if I see 3000 cycles I won't be surprised.

Novice multi-threading software developers will use bus locks all over the place and then wonder why their code is slow. The trick - as with everything that has to do with memory - is to economize on accesses.

Olof Forshell
  • 3,169
  • 22
  • 28