7

In trying to learn Swift 2.2, I am facing a serious performance drop when trying to allocate many small objects (fundamentally, a BST of 262144 elements). My current benchmark is a Java 1.8.0_74 compile of an old snip that I wrote some years ago, which, on my 2012 Retina Macbook Pro executes in 59 seconds (59036178 microsecond). The Issue I can observe via Instruments is that I get literally dozens of swift_retain_ and swift_release per iteration. Not sure how to avoid them:

import Foundation
import Darwin;

import Foundation

public class BinarySearchTree<T : Comparable> {
    private var _value : T?;

    private var _leftTree : BinarySearchTree<T>?;
    private var _rightTree : BinarySearchTree<T>?;

    public init(value : T) {
        _value = value;
    }

    var value : T? {
        get {
            return self._value;
        }
        set {
            self._value = newValue;
        }
    }

    var leftTree : BinarySearchTree<T>? {
        get {
            return self._leftTree;
        }
        set {
            self._leftTree = newValue;
        }
    }

    var rightTree : BinarySearchTree<T>? {
        get {
            return self._rightTree;
        }
        set {
            self._rightTree = newValue;
        }
    }

    public func add(newValue : T) -> BinarySearchTree<T> {
        var navigator : BinarySearchTree<T>?;
        var subtree : BinarySearchTree<T>?;

        var done : Bool?;

        done = false;
        navigator = self;

        while (!done!) {
            if (newValue < navigator?.value) {
                subtree = navigator?.leftTree;
                if (subtree != nil) {
                    navigator = subtree;
                } else {
                    let newNode = BinarySearchTree<T>(value: newValue);
                    navigator!.leftTree = newNode;
                    done = true;
                }
            } else if (newValue > navigator?.value) {
                subtree = navigator?.rightTree;
                if (subtree != nil) {
                    navigator = subtree;
                } else {
                    let newNode = BinarySearchTree<T>(value: newValue);
                    navigator?.rightTree = newNode;
                    done = true;
                }
            } else {
                done = true;
            }
        }
        return self;
    }
} /* cut remove/search methods */

And this is the test code I wrote for the test run

let count : Int32 = 262144;
let base : Int32 = 65536;
let target : Int32 = count + 1;

var info = mach_timebase_info(numer:0, denom:0);
var timebase = mach_timebase_info(&info);
let numer = UInt64(info.numer);
let denom = UInt64(info.denom);
let norm = UInt64(numer/denom);

let check1 = (mach_absolute_time() * norm);

var root = BinarySearchTree<Int32>(value:base);

for var loop in 0 ... count-1 {
    if (loop % 1000 == 0) {
        print(loop);
    }
    root = root.add(loop);
}

let check2 = (mach_absolute_time() * norm);
print("Creation phase microseconds: [" + String((check2 - check1) / 1000) + "]");

I tried searching for the specific swift release/retain issue with no luck, I am not sure how to proceed. Thanks everyone

IridiumFX
  • 73
  • 2
  • 3
    Are you sure that the allocation is the real problem? You insert numbers in increasing order, which makes the BST degenerate to a linked list and makes the look-up slow. – What happens if you insert random numbers instead? – Martin R Apr 17 '16 at 13:17
  • Hi Martin, I know that degenerates quite quickly, but that's just a test loop to check memory allocation. Worst case, a list of 262k elements would still be no match for any Mac since years. I used classes instead of structs (bad, I know) in order to avoid unnecessary memory copy. Needless to say, -Ofast and Release build. – IridiumFX Apr 17 '16 at 13:25
  • 1
    I ran your code with `root = root.add(Int32(arc4random_uniform(UInt32(count))))` instead, and it completed in about 0.5 seconds (Release configuration). The number of allocations is roughly the same! – That's why I think that the linear lookup is the problem, not the allocation. – Martin R Apr 17 '16 at 13:26
  • That's interesting indeed. Even taking into account that duplicate values would be skipped in the add method, it really moves the problem in a different direction, but can really pointer/reference navigation be the issue for such a small dataset? For the Java version surely not – IridiumFX Apr 17 '16 at 13:43
  • Did you try the same in Java, i.e. inserting 262144 increasing numbers into a BST with the same algorithm? What time did it take (compared to 262144 random numbers)? – Martin R Apr 17 '16 at 14:03
  • Yes, the swift code is a straight port from my older Java code. With the original version it took 59 seconds to complete the full 262144 inserts – IridiumFX Apr 17 '16 at 14:09
  • You claim that the Swift code is too slow, compared to Java. My question is: What time did it take to insert 262144 increasing numbers in *Java*? – Martin R Apr 17 '16 at 14:14
  • the for loop from 0 to 262144, inserting the increasing numbers, takes on average 59 seconds. Trying with a Random.nextInt(count) time goes down to 0.13 seconds. As I see from Kenneth Bruno's version, probably my understanding of Swift optional's is the issue here. – IridiumFX Apr 17 '16 at 14:24

2 Answers2

7

The issue, as you note is the retain/release (though it isn't really, retain/release is insignificant next to the power of ummm... we'll get there at the end). This isn't really related to allocations. You're not allocating extra objects, you're just retaining them briefly and then releasing them. I'll start with Kenneth's code, which optimizes out a lot of performance problems in the original, but still has this issue. (I'm not considering the recursive code because it crashes in your current use case. It does dodge some redundant retains, though.)

It is worth saying that Kenneth's code is good and is generally the way you should do things (as you'll see even more as we go along).

First note: when you mentioned -Ofast, that's for ObjC, not Swift. The flag for Swift is just -O. You also want -whole-module-optimization, but that doesn't really help anything here.

One more little thing, then we're get to it. Mark classes final any time you can. This makes sure there's no dynamic dispatch. This doesn't matter very much here compared the retain/release, but hey, take the easy stuff.

Does 30% off sound good?

OK, now a big one, and it's a trick. I find that I can knock out about 30% of the time (from ~6min to ~4min for the full import) by rewriting this:

guard let subtree = navigator.leftTree else {
    navigator.leftTree = BinarySearchTree<T>(value: newValue)
    break
}
navigator = subtree
continue

As this:

let subtree = navigator.leftTree
if subtree == nil {
    navigator.leftTree = BinarySearchTree(value: newValue)
    break
}
navigator = subtree!
continue

This is a something to be very careful with. It turns out to be faster in this case, but that may not be as fast across other inputs. That may not be as fast across changes to the optimizer (the SIL generation is a bit weird, and I suspect may actually be a mistake because it seems to double-retain navigator in the second case, but only after the if has succeeded). But it does seem to currently be faster. (EDIT: The Swift team was surprised by this finding, and there is now a bug opened against it. Do not expect this to work in the future.)

How about 85% How's that sound?

But like you said, couldn't we avoid all this with structs? But it'd be insanely expensive to copy the whole tree every time we touch it. Of course we could dramatically improve that with copy-on-write like Array uses. But COW is pretty complicated. If only there were a way to reuse the existing stuff. What if we use Array?

private struct Node<Element: Comparable> {
    let value: Element
    var leftIndex = -1 // Ugly, but ~25% faster than using Int? in my tests
    var rightIndex = -1
    init(_ value: Element) { self.value = value }
}

// This works exactly the same if you make it a `final class`. Your choice.
public struct BinarySearchTree<Element: Comparable> {
    private var storage: [Node<Element>] = []

    init(value: Element) { storage.append(Node(value)) }

    public mutating func add(newValue: Element) {
        if storage.isEmpty {
            storage.append(Node(newValue))
        }

        var index = 0

        while (true) {
            let node = storage[index]
            if (newValue < node.value) {
                if node.leftIndex < 0 {
                    storage.append(Node(newValue))
                    storage[index].leftIndex = storage.count - 1 // Don't use node here; remember value types!
                    break
                }
                index = node.leftIndex
                continue
            } else if (newValue > node.value) {
                if node.rightIndex < 0 {
                    storage.append(Node(newValue))
                    storage[index].rightIndex = storage.count - 1
                    break
                }
                index = node.rightIndex
                continue
            } else {
                break
            }
        }
    }
}

This takes ~45s to run on my system. Of course this makes delete a bit more complicated. You'd either have to accept "leaked" memory (possibly with periodic repacking), or you'd need to maintain a freelist. But a freelist wouldn't be too hard to add.

Let's try 99.97% improvement with no changes to add().

And of course it's important to remember that this is a nearly pathological case for BST. Even if you were often handed data in order, you'd be better off applying a shuffle prior to inserting it, even including the cost of the shuffle. For example, using shuffleInPlace (and counting its time), inserting exactly the same values:

var values = Array(0 ... count - 1)
values.shuffleInPlace()
for (loop, value) in values.enumerate() {
    if (loop % 1000 == 0) {
        print(loop)
    }
    root.add(value)
}

This takes us from 45s to about 0.1s. (Kenneth's version and my "!" version are about about 0.2s under this metric; I'd probably use Kenneth's solution, with final added. Even your original code, which has a lot of inefficiencies that Kenneth fixed, only takes 0.5s with this change. Remember, the Kenneth-optimized version with in-order add was 6 minutes on my system.)

It's worth it to shuffle before inserting. If you get things over time, it could be worth it to batch them up and shuffle them before inserting. If the tree changes over time, it'd be worth checking if it's gotten too deep and rebuilding it periodically. Keeping the tree depth reasonable overwhelms every other optimization. Clever ways of working around Swift memory management couldn't touch this one change.

Fix the algorithm. Everything else is peanuts in comparison.

Community
  • 1
  • 1
Rob Napier
  • 286,113
  • 34
  • 456
  • 610
1

I simplified your code, removing some Optionals and your getter/setters because they were unnecessary and could contribute to slow code.

I profiled both your code and mine and got this result on the same data set of random elements:

1000 elements:

Yours: Creation phase microseconds: [28680771]

Mine : Creation phase microseconds: [8564279]

10000 elements:

Yours: Creation phase microseconds: [426233689]

Mine : Creation phase microseconds: [126725800]

Here is my code:

public class BinarySearchTree2<T : Comparable> {
  public init(value : T) {
    self.value = value
  }

  var value : T
  var leftTree : BinarySearchTree2<T>?
  var rightTree : BinarySearchTree2<T>?

  public func add(newValue : T) -> BinarySearchTree2<T> {
    var navigator = self

    while (true) {
      if (newValue < navigator.value) {
        guard let subtree = navigator.leftTree else {
          navigator.leftTree = BinarySearchTree2<T>(value: newValue)
          break
        }
        navigator = subtree
        continue
      }
      if (newValue > navigator.value) {
        guard let subtree = navigator.rightTree else {
          navigator.rightTree = BinarySearchTree2<T>(value: newValue)
          break
        }
        navigator = subtree
        continue
      }
      break
    }
    return self
  }
} /* cut remove/search methods */

Edit:

I also did a more optimum balanced tree test where I created a data set of 1001 sequential elements, removed the middle element, used a Fisher-Yates shuffle to randomize the order, initialized the root with the middle element, and ran both sets. Here are my results:

Yours: Creation phase microseconds: [27648219]

Mine: Creation phase microseconds: [8332361]

Edit 2:

I switched the add() method to use recursion with significant gains in speed:

Before (my original code): Creation phase microseconds: [8088804]

After : Creation phase microseconds: [1179398]

Here is the new code:

public class BinarySearchTree3<T : Comparable> {
  public init(value : T) {
    self.value = value
  }

  let value : T
  var leftTree : BinarySearchTree3<T>?
  var rightTree : BinarySearchTree3<T>?

  public func add(newValue : T) {
    if (newValue < self.value) {
      if self.leftTree?.add(newValue) == nil {
        self.leftTree = BinarySearchTree3<T>(value: newValue)
      }
      return
    }
    if (newValue > self.value) {
      if self.rightTree?.add(newValue) == nil {
        self.rightTree = BinarySearchTree3<T>(value: newValue)
      }
      return
    }
  }
} /* cut remove/search methods */
  • Thanks for the code. now it gets certainly faster, I understand that I have still much to experiment – IridiumFX Apr 17 '16 at 14:28
  • `Optionals` are very powerful but they do come at some cost, testing and unwrapping take up resources. Used judiciously they can make your code much safer. The same goes for computed properties. –  Apr 17 '16 at 14:34
  • I see, and I think that other than giving a good example, your explanation will surely make the life easier to anyone like me trying to jump from Java to swift. The swift version is still 5 times slower than the Java one, but I am sure that's because it is not really possible to transcode without embracing swift's language peculiarities – IridiumFX Apr 17 '16 at 14:42
  • Switching your `add()` method to recursion showed significant gains, see my new edit. –  Apr 17 '16 at 15:10
  • That's reducing the number of runs, I presume. The Java recursive version stack overflows, and BinarySearchTree3 Segfaults with code 11, which I bet is the same. For completeness, I observed also another funny behaviour: with both my version and your BinarySearchTree2, Activity monitor correctly gives a 100% CPU time to the app, but the CPU inspector indicates that I have 4 cores at 25%. I am starting to think that the issue could be at an os/scheduler level – IridiumFX Apr 17 '16 at 15:20
  • I don't get a segfault with a test set size of 1000 or even 10000. In addition I'm running this in a playground and the number of times that a new `BinarySearchTree3` is created looks correct. –  Apr 17 '16 at 15:25
  • Running with the full 262144 set I see that Java dies at 29000 while swift survives up to 52000 – IridiumFX Apr 17 '16 at 15:42