Micro optimisations iterating through a tree in C#

Question

I'm working on a massive number crunching project. I've been optimising everything since the start as I knew it would be important. Doing performance analysis my code spends almost 40% of it's life in one function - the binary tree iterator.

        public ScTreeNode GetNodeForState(int rootIndex, float[] inputs)
        {
0.2%        ScTreeNode node = RootNodes[rootIndex].TreeNode;

24.6%       while (node.BranchData != null)
            {
0.2%            BranchNodeData b = node.BranchData;
0.5%            node = b.Child2;
12.8%           if (inputs[b.SplitInputIndex] <= b.SplitValue)
0.8%                node = b.Child1;
            }

0.4%        return node;
        }

Do any C# optimisation experts have any tips for optimising this further? All comparisons are floats. I know that in theory it shouldn't matter, but I'm using fields rather than properties so ensure optimisation. A small saving here could shave days off the process.

Please no replies saying "These optimisations don't matter in the real world" - because in this instance they do. :-)

Edit: I've updated the code to what I've got now following the comments below, and added in the performance analysis output for each line of code. As you can see, the main killer is the null check - why? I tried using a boolean flag IsLeaf on the node instead of the null check, but it was a equal performance hit on that line.

The code for branch node object is as follows:

public sealed class BranchNodeData
{
    /// <summary>
    /// The index of the data item in the input array on which we need to split
    /// </summary>
    internal int SplitInputIndex = 0;

    /// <summary>
    /// The value that we should split on
    /// </summary>
    internal float SplitValue = 0;

    /// <summary>
    /// The nodes children
    /// </summary>
    internal ScTreeNode Child1;
    internal ScTreeNode Child2;
}

Another Edit: Yet more thinking here... I was wondering why the line

BranchNodeData b = node.BranchData;

was registering 0.2% of execution and the null comparison line was registering 17.7%. I'm guessing this is a branch prediction fail? While that comparison being hit multiple times, and almost always returning true, it makes it very hard for CPU to predict when it's going to return false. I'm not very clued up on the low level workings of a CPU, but is this likely be the case?

One idea is to prefetch the states (into a dictionary) and update them when data changes.In this way, when you will need to get the state for a node the operation will be done in O(1) — Bogdan M., May 07 '13 at 09:57
Just thoughts from a C++ perspective... I don't know how much is possible in C#, but when you walk through a tree you are already losing cache locality. And if all your objects are scattered through memory, locality is gone. If your tree is dense, consider storing it as an array or use a memory pool (if you can do that in C#). You'll also be a victim of branch prediction here. Not sure if you can do much about that. — paddy, May 07 '13 at 10:00
One _minor_ thing might be to store the `node.BranchData` in a temporary variable rather than load the field three times per while-loop iteration. — Chris Sinclair, May 07 '13 at 10:01
Try replacing the `if` with `node = node.BranchData.Children[(1 + Math.Sign(inputs[node.BranchData.SplitInputIndex] - node.BranchData.SplitValue))/2];` — Sergey Kalinichenko, May 07 '13 at 10:01
Thanks for the feedback. I've tried the two easy changes. My code: 16.9m iterations a second. Chris Sinclair: 17.0m p/s. dasblinkenlight: 13.5m p/s. A possible slight improvement with Chris's suggestion. I've not compared the IL yet though. I'm not 100% sure what you're saying Bogdan. Paddy, cache locality is something I've never considered before. I'll have a think, but it sounds tricky to maintain with the tree structure, which is something I require. — Will Calderwood, May 07 '13 at 11:08
@WillCalderwood Since this is a binary tree (I think) would it be possible to ditch the `Children` collection and just have direct references to `Child1` and `Child2` and avoid the collection lookup? — Chris Sinclair, May 10 '13 at 18:22
I'll have a play Chris. As it's a constant index within an array, I'm guessing it might be optimised out anyway. — Will Calderwood, May 11 '13 at 15:03
Chris, I did what you said. It did indeed yield a 15% speed improvement. Thanks. — Will Calderwood, May 14 '13 at 17:02
I realise this is probably an annoying question, but I have to ask: You *are* timing the release build run without a debugger attached, are you? Otherwise timings can be way off. — Matthew Watson, May 14 '13 at 17:54
Whilst many (including myself) quote "Premature optimization is the root of all evil", The full quote states that 97% of the time we shouldn't be concerned with it, but we *should be mindful of the critical 3%". If a small win here shaves days off your running-time, I think it's safe to say this falls in to that 3% :) — Moo-Juice, May 14 '13 at 18:59
What is your BranchNodeData? A value type or a reference type? Is it implementing IEquatable? Is it overriding Equals or GetHashCode? — giacomelli, May 14 '13 at 19:02
BranchNodeData is a reference type. Not implementing or overriding IEquatable, Equals or GetHashCode. I'll add the code for it into the question. — Will Calderwood, May 14 '13 at 19:04

score 3 · Answer 1 · answered May 14 '13 at 22:17

3

Just some code rewrite. It might help because it avoids at least two jumps.

public ScTreeNode GetNodeForState(int rootIndex, float[] inputs)
{

    ScTreeNode node = RootNodes[rootIndex].TreeNode;

    while (node.BranchData != null)
    {
        BranchNodeData b = node.BranchData;
        node = b.Child2;
        if (inputs[b.SplitInputIndex] <= b.SplitValue))
            node = b.Child1;
    }

    return node;

}

answered May 14 '13 at 22:17

Zoran Horvat

10,924
3
31
43

Thanks for that. I ran my latest performance test. Search for every leaf in a 1GB tree 100 times. My code took 5190ms, yours took 4827ms. It's small, but progress. :-) – Will Calderwood May 14 '13 at 23:04

score 0 · Answer 2 · answered May 14 '13 at 22:31

BranchNodeData looks like a reference type. its only 0.2% of your runtime because its just making a pointer to the data that already exists, not actually copying or assigning anything.

You're probably getting such a hit on the null check because the CLR is having to do a cast in order to check the sealed class you've pasted in. Checking for nullity there isn't necessarily what you're after. There are a ton of ways to modify that class to give you a boolean to check against that wouldn't require as much computing power. I'd honestly go the route of having that be something that your ScTreeNode class can provide.

I guess you didn't read this bit " I tried using a boolean flag IsLeaf on the node instead of the null check, but it was a equal performance hit on that line." ;-) — Will Calderwood, May 14 '13 at 22:35

score 0 · Answer 3 · answered May 22 '13 at 07:58

Given the points made in the other answer about caching, but not in relation to the null check, try ordering the references to the BranchNodeData fields so that the first reference allows all of the following fields to be loaded into the cache.

That is, I assume the Jitter, or the CPU, is not smart enough to load "backwards" to cache SplitInputIndex, SplitValue and Child1 when Child2 is referenced first in the current code.

So either change the order of the fields in the BranchNodeData class, or change the set; if ... overwrite; to an if ... else.

Micro optimisations iterating through a tree in C#

3 Answers3

Linked