3

First of all I assume I've missed something major when thinking about this, but I still wanted to post about it to see if I really didn't miss anything, on with it...

I have a pretty write heavy binary tree (about 50/50 between writes and reads), and on the way home today I was thinking about ways to optimize this, especially make the writes faster - this is what I came up with.

Considering that the operation add(T, x) to add x to tree T first consists of find(T, x) to see if x already exists, and in that case it doesn't return the parent so we can add it instead of one of the parents empty leaves.

What if we add a hash table as an intermediate cache to the add operation, so when we call add(T, x) what really happens is that x is hashed and inserted into the hash map M. And that's it. The optimization takes place when we somewhere else asks to find(T, x), now when we search the tree we will come to a leaf node, since x isn't inserted the tree yet (it only exists in the hash map M), we hash x and compare it to the keys in M to see if it is supposed to be in the tree. If it's found in M then we add it to the tree and remove it from M.

This would eliminate the find(T, x) operation on add(T, x) and reduce it to add(M, x) which is O(1). And then (ab)-use the find(T, x) operation that is performed when we look up the node the first time to insert it.

thr
  • 19,160
  • 23
  • 93
  • 130
  • This will require you to maintain 2 structures, and apply a sufficiant hashing alogrith to avoid collisions in large structures. Have you had a look at this additional overhead? – Adriaan Stander Dec 07 '09 at 20:00
  • @astander: no I have not, which is why I'm asking here to see if it has any obvious flaws, and I could just use the built in hash table in my platform of choice (.NET) which should be good enough. And it's not hard to detect a collision (if I try to insert a value that already exists on that hash in M, compare the two values and if they are not equal do a normal binary insert on the second element that is a collision) – thr Dec 07 '09 at 20:04
  • 1
    Also, you probably want to check the complexity of N adds and N finds (assuming 50/50) rather than a single add or find. I think you are delaying the work rather than optimizing the work. – Kathy Van Stone Dec 07 '09 at 20:06
  • @Kathy: may be, but since the compare operation on the keys (strings in this case) is quite heavy (and I could use a very simple+fast hash algorithm because collisions doesn't matter) is have a feeling that I could save quite a bit, especially on larger trees. – thr Dec 07 '09 at 20:10
  • Beware of "collisions don't matter". With the algorithm you describe, your hashtable is not allowed to lose bindings, so you cannot use fixed-length buckets as are used in caches traditionaly. You must have extensible buckets that degenerate into O(n) association lists when there are many collisions. – Pascal Cuoq Dec 07 '09 at 20:18
  • You could try using a counting bloom filter to determine if something is in the set. That supports additions, removals, and constant space. – Martin Dec 07 '09 at 20:54
  • A bloom filter is also O(1) for add, remove and find – Martin Dec 07 '09 at 20:54
  • After implementing normal binary trees (I don't have time to change the RB-tree implementation tonight), one which is a default implementation and one which is the cached version that utilizes a HashSet (.NET) to cache elements my results are about ~20% speedup on 50/50 reads/writes. – thr Dec 07 '09 at 21:10

2 Answers2

7

Why not use a hashtable for everything and omit the binary tree entirely?

It all depends why you were using binary trees in the first place. If you chose binary trees to enhance sharing, you lose with the hashtable caches because hashtables are not shared.

The caches do not make comparing two maps any easier either.

EDIT:

If the operations that take advantage of the specificities of trees are rare (you mention taking advantage of the fact that RB trees are sorted) and if, on the other hand, you often look up a key that has recently been added, or replace the value of a key that has recently been added, a small-size cache implemented with another structure may make sense. You can also consider using a hashtable representation with the occasional conversion to a tree.

The additional complexity of this cache layer may mean that you do not gain any time in practice though, or not enough to repay the technical debt of having an ad-hoc data structure like this.

Pascal Cuoq
  • 79,187
  • 7
  • 161
  • 281
  • Well I use the binary tree because it's sorted. The tree (RB-tree in this case) is used as a sorted index that is updated frequently. – thr Dec 07 '09 at 20:02
5

If your need is to have a structure that has O(1) inserts, and approximately O(n) amortized ordered iteration, I had the same problem:

Key-ordered dict in Python

The answer (keeping a hash and a partially sorted list, and using a partially-sorted-structure-friendly sort like TimSort) worked very well in practice in my case.

Community
  • 1
  • 1
LeMiz
  • 5,554
  • 5
  • 28
  • 23