9

I've been always concerned with the space usage of a radix tree, but I didn't find any helpful discussions on this.

Now let's say we have a radix tree implementation same as linux radix-tree.c, which takes a integer and use every 6 bits to index the next position in tree. I can easily think of cases where radix tree's space usage is far more than binary search trees. Please correct me if I'm wrong:

Use cases: (0,1,1,1,1), (1,1,1,1,1), (2,1,1,1,1), ... (63,1,1,1,1).

Here just for convenience, I use (a,b,c,d,e) to represent a 30-bit integer key, with each element standing for a 6-bit value. a is MSB and e is LSB.

Radix Tree:

For this use case, radix tree will have a height of 5, and each key will take 4 separate nodes, because they are on different subtrees of root. So there will be ((5-1) * 64 + 1) = 257 nodes.

Each node contains 2^6 = 64 pointers, so it is going to use 257 * 64 * 4Byte = 65KB

Binary Search Tree

We only care how many keys are there. In this case it has 64 keys.

Assume each BST node uses 3 Pointers per Node, it is going to use 64 * 3 * 4Byte = 768 Bytes.

Comparison

Looks radix tree is very space inefficient. It uses ~100 times space than binary search tree given same number of nodes! I don't understand why it is used even in linux kernel.

Am I missing something? Thanks.

Jun
  • 426
  • 1
  • 7
  • 16

3 Answers3

9

You asked for space complexity, so let's work it out.

If we consider a non-null pointer at the leaf to be a value of interest, then it is not hard to prove by contradiction that the worst case is a fully populated tree with one value per leaf node.

If branching is N-way (in your use case 64) and height is H (in your use case 5), there are N^(H-1) leaf nodes in this tree, storing an equal number of values. The total number of nodes is

1 + N + N^2 + ... N^(H-1) = (N^H - 1) / (N-1)

So the storage requirement measured in pointers is N times this amount.

(N^H - 1)  [N / (N-1)]  

This yields a storage efficiency of

(N^H - 1)  [N / (N-1)]  
--------------------
       N^(H-1)

This is the total number of pointers divided by the count of valid data pointers.

As N gets bigger, this approaches N. In your example use case, it's actually 65.01 (for N=64). So we can say the storage complexity is O(NV) where V is the number of data values to be stored.

Though we got here with first-principles analysis, it makes total sense. Storage for the leaf level of the complete tree dominates the rest by a factor of nearly N. The size of that storage is NV.

Of course the advantage of trees with enormous branching factors like this (and e.g. B-trees in databases) is that fewer node traversals are needed to get to the right leaf.

Moreover, when each traversal is a single array lookup as in the radix tree, you can't get much faster.

In your use case, a perfectly balanced binary search tree would require up to 30 comparisons with attendant branches flushing the pipeline. This compared to 5 array indexing operations could be much slower. Array indexing tends to be faster than comparison because it's non-branching code. But even if they are the same, the binary tree would only need 2^5=32 elements to cause the same amount of indexing work as a radix tree containing 2^30 elements.

To generalize this, a binary tree of 2^H elements will require the same lookup effort as an index tree capable of holding N^(H-1) elements if key comparsions and array index operations have the same cost.

As others have said, if the index bits for the top levels of the tree tend to a few common prefixes (i.e. they are the top bits of addresses of the same VM space), the worst case storage behavior of the radix tree does not occur.

Gene
  • 46,253
  • 4
  • 58
  • 96
  • 1
    Space complexity should be a function with the number of keys as a variable. In your case I understand you calculated the total number of nodes as ((N^H - 1) * [N / (N-1)])/N^(H-1), but N, being the branch numbers of a radix tree, should be a given constant. What matters is the variable V in your comment right? I don't understand how did you draw out the conclusion that space complexity is proportional to V. Can you clarify a little bit? – Jun Dec 18 '13 at 06:04
  • Complexity is expressed in whatever relevant independent variables exist in the problem. If you choose to make N a constant, then big-O erases it from the problem and you have O(V), which is not helpful. Dependence on V is explained above. The worst case has one value per leaf node. O(NV) says that you'll use in this worst case about N times the storage for the index as for the values (non-null pointers) themselves. The more precise expression is "approaching NV as N gets larger." There is no hidden constant factor. – Gene Dec 18 '13 at 13:54
2

Radix tree are used a lot of holding long strings with a common/shared prefixes. in this case the radix tree will be much more economical.

For the sort of data you're specifying it's a different story.

Edit

A good example for long strings with prefixes is storing all file names with full path on your computer. With such data, it will be more economical than the alternatives and be very fast for finding if a file name exists or not. Might even be faster in some cases than a hash table.

Look at these 2 files:

  • "c:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\include\streambuf"
  • "c:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\include\string"

Their shared prefix is: "c:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\include\str", which is stored only once.

egur
  • 7,830
  • 2
  • 27
  • 47
  • Arbitrary Strings makes no difference in this case. You can convert the use case I posted into "aaaaa", "bbbbb", ... "zzzzz", it will still take much more space than BST. But I agree if the use cases share a lot of prefixes it will get much better. – Jun Dec 13 '13 at 21:34
  • See updated answer with example. BTW, "aaaaa" and "bbbbb" don't have a common prefix. – egur Dec 13 '13 at 22:10
  • I understand what you are saying. But in general we cannot ensure that the keys are having common prefixes. I'm talking about arbitrary cases, where keys can be any value and its possibility is evenly distributed among all possible values. – Jun Dec 13 '13 at 23:12
  • @Jun - then use a different container like map/set or unordered_map/set – egur Dec 14 '13 at 09:36
2

The radix tree in Linux originally appeared as a data structure to support the page cache, where such distributions of keys (file offsets) are uncommon.

(FWIW, the initial variant used a splay tree, but Linus said no :)

The radix tree is wide and shallow, so a lookup in it accesses comparatively few different cache lines, which is, obviously, quite beneficial for performance.

It also has the property that locality in page cache accesses means locality in radix tree node accesses, unlike alternative designs like hash table, for example.

chill
  • 16,470
  • 2
  • 40
  • 44
  • So before using radix tree we always need to presume that the keys will mostly share same prefix? So sounds like radix tree is not very much like a general purpose container. – Jun Dec 17 '13 at 07:16
  • @Jun, well, like any data structure, radix trees have their advantages and drawbacks. There's no universally applicable "best container", depending on concrete needs one or another may be most suitable. This is not unusual, e.g. radix or counting sort is may not be "best" sorting methods either, but be best for a concrete usage. Or in the case of on-disk storage, B-trees are generally considered preferable to binary trees. Or even for in-memory containers, check https://code.google.com/p/cpp-btree/ . – chill Dec 17 '13 at 07:31