Hash table vs Balanced binary tree

Question

What factors should I take into account when I need to choose between a hash table or a balanced binary tree in order to implement a set or an associative array?

https://stackoverflow.com/questions/4128546/advantages-of-binary-search-trees-over-hash-tables — Ciro Santilli OurBigBook.com, Sep 25 '17 at 20:57

score 53 · Accepted Answer · answered Jan 31 '11 at 12:24

This question cannot be answered, in general, I fear.

The issue is that there are many types of hash tables and balanced binary trees, and their performances vary widely.

So, the naive answer is: it depends on the functionality you need. Use a hash table if you do not need ordering and a balanced binary tree otherwise.

For a more elaborate answer, let's consider some alternatives.

Hash Table (see Wikipedia's entry for some basics)

Not all hash tables use a linked-list as a bucket. A popular alternative is to use a "better" bucket, for example a binary tree, or another hash table (with another hash function), ...
Some hash tables do not use buckets at all: see Open Addressing (they come with other issues, obviously)
There is something called Linear re-hashing (it's a quality of implementation detail), which avoids the "stop-the-world-and-rehash" pitfall. Basically during the migration phase you only insert in the "new" table, and also move one "old" entry into the "new" table. Of course, migration phase means double look-up etc...

Binary Tree

Re-balancing is costly, you may consider a Skip-List (also better for multi-threaded accesses) or a Splay Tree.
A good allocator can "pack" nodes together in memory (better caching behavior), even though this does not alleviate the pointer-look-up issue.
B-Tree and variants also offer "packing"

Let's not forget that O(1) is an asymptotic complexity. For few elements, the coefficient is usually more important (performance-wise). Which is especially true if your hash function is slow...

Finally, for sets, you may also wish to consider probabilistic data structures, like Bloom Filters.

@ProfVersaggi: Actually, that's not even true; some hash tables handle duplicates poorly, but some do well. I advise you to read Joaquín M López Muñoz's [entries on the topic](http://bannalia.blogspot.fr/2014/01/a-better-hash-table.html). He authored, and is maintaining, Boost MultiIndex. — Matthieu M., Jan 17 '14 at 07:58

score 45 · Answer 2 · answered Jan 31 '11 at 00:04

45

Hash tables are generally better if there isn't any need to keep the data in any sort of sequence. Binary trees are better if the data must be kept sorted.

answered Jan 31 '11 at 00:04

supercat

77,689
9
166
211

1

While not maintaining sorting, hash-tables that can maintain (insert) order are somewhat trivial. – Jan 31 '11 at 00:09
4

That's not so easy. I'm afraid of a couple of things: 1. hash tables have got bad performance (O(n)) at the worst case 2. in order to resize the hash table I've got to rehash anything, this is pretty expensive. This question is to know how can I avoid such points and to be informed about the other _issues_ I'm missing. – peoro Jan 31 '11 at 00:22
pst: Maintaining insert order is possible with almost any 'black-box' collection; to what extent can one maintain sort order with a hash table better than with a 'black-box'? – supercat Jan 31 '11 at 00:24
1

@peoro: O(n) is not practically possible unless someone knows your implementation detail and simply wants to break you. Even consider the resize operation (happens at a reasonable interval), hash costs much less than balanced tree. – Haozhun Jan 31 '11 at 14:53
1

@peoro: To amplify Gene's point, if each resize on a hash table doubles the size (pretty typical), then immediately after a resize, half the items will have been rehashed once, a quarter of them twice, an eighth of them three times, 1/16 of them four times, etc. so no matter how big the table gets, the average number of rehashes will be less than two. The point about degenerate hashing situations is a good one, though. For example, if one makes a Dictionary of a struct type without overriding GetHashCode, and many keys have the same value in the first field, performance will be bad. – supercat Jan 31 '11 at 18:11
@peoro: rehashing is usually not necessary (the hash can be stored alongside the data), growing the hash table can be done "online" (migration phase), and the O(n) worst case can be alleviated by the hash table structure (and there are many to choose from). – Matthieu M. Feb 01 '11 at 07:46
@MatthieuM.: The O(n) worst case for a hash-table is unavoidable unless the number of different values an element might take is limited). Otherwise, there's no way any hash table can achieve better than O(n) behavior if the elements it is given are all indistinguishable except by an equality-check operation (i.e. the all have the same hash code, and no ranking comparisons, partial-bit scans, or other such operations are available). – supercat Jun 11 '12 at 17:23
@supercat: you're right, of course. However a "simple" hashtable is just an array of linked-lists, and linked-lists have a tendancy to degenerate quickly while if you use an array of open-addressed hashtables you have less chance to get to that worst-case. This is what I meant by *alleviated*; not that it was completely out of the picture, but that by choosing a different structure you could dimish the chances of it happening greatly. – Matthieu M. Jun 12 '12 at 08:20
@MatthieuM.: Some hash tables are prone to degenerate into linear performance if the number of items gets close to the table size, no matter how good the hash function is. Effective algorithm design can reduce that danger. A hash table is going to be doomed to O(n) performance if the hash function is bad, though, no matter what sort of algorithm the table tries to use. – supercat Jun 12 '12 at 13:35
In C++, the std::map is a red-black tree, and std::unordered_map is the hash table. I usually prefer std::map since I am guaranteed I will never run into trouble with re-hashing that can e.g. destroy hard real-time systems. – Erik Alapää Apr 01 '15 at 08:50

score 12 · Answer 3 · answered Jan 31 '11 at 00:14

A worthy point on a modern architecture: A Hash table will usually, if its load factor is low, have fewer memory reads than a binary tree will. Since memory access tend to be rather costly compared to burning CPU cycles, the Hash table is often faster.

In the following Binary tree is assumed to be self-balancing, like a red black tree, an AVL tree or like a treap.

On the other hand, if you need to rehash everything in the hash table when you decide to extend it, this may be a costly operation which occur (amortized). Binary trees does not have this limitation.

Binary trees are easier to implement in purely functional languages.

Binary trees have a natural sort order and a natural way to walk the tree for all elements.

When the load factor in the hash table is low, you may be wasting a lot of memory space, but with two pointers, binary trees tend to take up more space.

Hash tables are nearly O(1) (depending on how you handle the load factor) vs. Bin trees O(lg n).

Trees tend to be the "average performer". There are nothing they do particularly well, but then nothing they do particularly bad.

score 7 · Answer 4 · answered Jan 31 '11 at 00:07

7

Hash tables are faster lookups:

You need a key that generates an even distribution (otherwise you'll miss a lot and have to rely on something other than hash; like a linear search).
Hash's can use a lot of empty space. You may reserve 256 entries but only need 8 (so far).

Binary trees:

Deterministic. O(log n) I think...
Don't need extra space like hash tables can
Must be kept sorted. Adding an element in the middle means moving the rest around.

answered Jan 31 '11 at 00:07

whitey04

1,321
11
17

What do you mean when you say that binary trees are deterministic? Hash tables are deterministic as well. Also, operations on binary trees are O(h) where h is the height. If it's a *balanced* binary tree, then h=O(log(n)). – Daniel Egeberg Jan 31 '11 at 00:11
3

Not true! Hash tables can "miss". For instance if you have an array of 10 and use a phone number to index into it (use a modulo for instance) you could get a hash that points you to the first element of the array. However, if when the array was built 9 other numbers with the same hash were used first; you actually have to go all the way to the last element. In a binary search you are guaranteed to get BigO(log n) no matter what. !DISCLAIMER! It all depends on how you build up your hash sort/search. There are many ways... – whitey04 Jan 31 '11 at 00:14
1

Adding an element in the middle *does not* mean moving the rest around. Its a linked data structure, not an array (maybe you are confusing Binary Search Tree with Binary Search which are two very different things. All operations are O(log(n)), if adding/removing to the middle meant moving the rest it would have been O(n). – MAK Jan 31 '11 at 04:27
1

It all depends on how you implement it... Using a linked tree is a good way to bypass the insertion problem of a binary search. However, The binary search (with a tree underneath it or not) will always return a result in O(log n). A hash can't unless the input key is 1:1 with the generated hash. – whitey04 Jan 31 '11 at 04:57

score 7 · Answer 5 · answered Jan 31 '11 at 18:04

7

A binary search tree requires a total order relationship among the keys. A hash table requires only an equivalence or identity relationship with a consistent hash function.

If a total order relationship is available, then a sorted array has lookup performance comparable to binary trees, worst-case insert performance in the order of hash tables, and less complexity and memory use than both.

The worst-case insertion complexity for a hash table can be left at O(1)/O(log K) (with K the number of elements with the same hash) if it's acceptable to increase the worst-case lookup complexity to O(K) or O(log K) if the elements can be sorted.

Invariants for both trees and hash tables are expensive to restore if the keys change, but less than O(n log N) for sorted arrays.

These are factors to take into account in deciding which implementation to use:

Availability of a total order relationship.
Availability of a good hashing function for the equivalence relationship.
A-priory knowledge of the number of elements.
Knowledge about the rate of insertions, deletions, and lookups.
Relative complexity of the comparison and hashing functions.

answered Jan 31 '11 at 18:04

Apalala

9,017
3
30
48

2

"A binary search tree requires a total order relationship among the keys. A hash table requires only an equivalence or identity relationship with a consistent hash function." This is misleading. A binary search tree could always just use the same keys as the hash table: hash values. It is not a restriction on cases when trees may be used, compared to hash tables. – rlibby Feb 10 '11 at 19:43
@rlibby Though most implementations of hash keys by default use types on which a total order is defined (integers or pointers), only equivalence is required if you provide your own hashes. So, in general, you cannot use a binary search tree upon hash keys, because you don't know what the hashes are, where they came from, or much less if they support a total order relationship. – Apalala Feb 18 '11 at 14:17
1

but if I'm understanding your suggestion correctly, then such a hash value also cannot be used in a hash table. Surely if it *can* be used in a hash table then it can *also* be used in a tree set. If it can be used in a table, then it must map to some index in the table. One could use the function that generates this index to generate keys for the tree set. – rlibby Feb 20 '11 at 23:02
2

@rlibby A hash table requires that elements that are equal have the same hash, but it doesn't require that elements that are different have different hashes. If different elements have the same hash, then there's no total-order relationship. – Apalala Nov 12 '12 at 04:08
If different elements often have the same hashes, the hashtable would be very slow anyway. It is indeed possible to store a linked list on each node of a binary tree just like a linked list on each entry of a hash table. – SOFe Mar 08 '19 at 06:01

score 3 · Answer 6 · answered Jan 31 '11 at 00:08

3

If you only need to access single elements, hashtables are better. If you need a range of elements, you simply have no other option than binary trees.

answered Jan 31 '11 at 00:08

biziclop

48,926
12
77
104

score 3 · Answer 7 · answered Jan 31 '11 at 16:18

3

To add to the other great answers above, I'd say:

Use a hash table if the amount of data will not change (e.g. storing constants); but, if the amount of data will change, use a tree. This is due to the fact that, in a hash table, once the load factor has been reached, the hash table must resize. The resize operation can be very slow.

answered Jan 31 '11 at 16:18

David Weiser

5,190
4
28
35

2

The worst-case time for adding an element to a hash table is O(n) because of the resize, but if a hash table doubles in size each time, the fraction of additions that require a rehash will drop as the table size increases. The average number of rehash operations per element will never exceed two, no matter how big the table gets. – supercat Jan 31 '11 at 18:15
If the hash table size is *doubling*, then I'd be surprised if the number of collisions decreased because hash tables work best (i.e a low number of collisions) when the size of the table is prime. Also, if you're asking the system to give you twice as much memory each time you resize, you'll quickly run out of memory (or slow the system down if the system rearranges its memory to give you the amount of contiguous memory you're asking for). – David Weiser Jan 31 '11 at 19:00
doubling is a common strategy but it isn't required. What is required is exponential growth. You can pick a smaller exponent if you like, it will just mean that the average number of rehash operations will be higher. In any case the amortized cost of n inserts in a table with exponential growth is O(n), while self-balancing binary search trees cost O(n*log(n)). – rlibby Feb 10 '11 at 19:38

score 2 · Answer 8 · answered Nov 15 '13 at 22:22

One point that I don't think has been addressed is that trees are much better for persistent data structures. That is, immutable structures. A standard hash table (i.e. one that uses a single array of linked lists) cannot be modified without modifying the whole table. One situation in which this is relevant is if two concurrent functions both have a copy of a hash table, and one of them changes the table (if the table is mutable, that change will be visible to the other one as well). Another situation would be something like the following:

def bar(table):
    # some intern stuck this line of code in
    table["hello"] = "world"
    return table["the answer"]

def foo(x, y, table):
    z = bar(table)
    if "hello" in table:
        raise Exception("failed catastrophically!")
    return x + y + z

important_result = foo(1, 2, {
    "the answer": 5,
    "this table": "doesn't contain hello", 
    "so it should": "be ok"
})
# catastrophic failure occurs

With a mutable table, we can't guarantee that the table a function call receives will remain that table throughout its execution, because other function calls might modify it.

So, mutability is sometimes not a pleasant thing. Now, a way around this would be to keep the table immutable, and have updates return a new table without modifying the old one. But with a hash table this would often be a costly O(n) operation, since the entire underlying array would need to be copied. On the other hand, with a balanced tree, a new tree can be generated with only O(log n) nodes needing to be created (the rest of the tree being identical).

This means that an efficient tree can be very convenient when immutable maps are desired.

score 1 · Answer 9 · answered Jan 31 '11 at 20:49

1

If you''ll have many slightly-different instances of sets, you'll probably want them to share structure. This is easy with trees (if they're immutable or copy-on-write). I'm not sure how well you can do it with hashtables; it's at least less obvious.

answered Jan 31 '11 at 20:49

Darius Bacon

14,921
5
53
53

score 1 · Answer 10 · answered Feb 05 '11 at 16:16

In my experience, hastables are always faster because trees suffer too much of cache effects.

To see some real data, you can check the benchmark page of my TommyDS library http://tommyds.sourceforge.net/

Here you can see compared the performance of the most common hashtable, tree and trie libraries available.

score 0 · Answer 11 · answered May 31 '16 at 06:52

0

One point to note is about the traversal, minimum and maximum item. Hash tables don’t support any kind of ordered traversal, or access to the minimum or maximum items. If these capabilities are important, the binary tree is a better choice.

answered May 31 '16 at 06:52

Yogesh Umesh Vaity

41,009
21
145
105

Hash table vs Balanced binary tree

11 Answers11

Linked