Hash tables v self-balancing search trees

Question

I am curious to know what is the reasoning that could overweighs towards using a self-balancing tree technique to store items than using a hash table.

I see that hash tables cannot maintain the insertion-order, but I could always use a linked list on top to store the insertion-order sequence.

I see that for small number of values, there is an added cost of of the hash-function, but I could always save the hash-function together with the key for faster lookups.

I understand that hash tables are difficult to implement than the straight-forward implementation of a red-black tree, but in a practical implementation wouldn't one be willing to go an extra mile for the trouble?

I see that with hash tables it is normal for collisions to occur, but with open-addressing techniques like double hashing that allow to save the keys in the hash table itself, hasn't the problem been reduced to the effect of not tipping the favor towards red black trees for such implementations?

I am curious if I am strictly missing a disadvantage of hash table that still makes red black trees quite viable data structure in practical applications (like filesystems, etc.).

both data structures have pros and cons. You should pick the one that fits better in your problem. — Nick Dandoulakis, Jul 16 '10 at 13:44

score 21 · Accepted Answer · answered Jul 16 '10 at 13:55

21

Here is what I can think of:

There are kinds of data which cannot be hashed (or is too expensive to hash), therefore cannot be stored in hash tables.
Trees keep data in the order you need (sorted), not insertion order. You can't (effectively) do that with hash table, even if you run a linked list through it.
Trees have better worst-case performace

answered Jul 16 '10 at 13:55

unbeli

29,501
5
55
57

1. If you can't produce a hash key, how do you produce a key to determine where the node is placed in the tree? If you can produce a key for the node placement, why can't you use that for the hash key? 2. Why can't you effectively do this with a hash table + linked list? Can you provide more explanation? Remember that a linked list is essentially just a tree optimized for ordering. 3. Trees have a best case of log(N). Hashing is always constant. Collisions have the same effect on both structures. How can trees have better worst-case performance? – Jake Sep 12 '10 at 21:00
4

@Jake 1- by comparing elements. Having order on something doesn't mean you can hash something. 2- Because. 3- There is nothing to collide in trees. – unbeli Sep 12 '10 at 21:05
1. The item you're comparing is the key. All binary data can be hashed by some method, and all data in a computer is binary data. I suppose it could get expensive if the keys are huge, but that would likely have the same effect on the comparison function of the tree. 3. Search trees have ordered nodes. You can only order a list if the list has a set of keys in some form on which to order the elements. If two nodes have the same key, that is a collision. – Jake Sep 12 '10 at 21:21
2

1- not true. 3- not true about the keys: you don't need any keys for the search tree. If you insert an entry, which is equal to an existing entry, there is no performance impact, contrary to the hash collision. It is your homework to understand why. – unbeli Sep 13 '10 at 19:46

score 7 · Answer 2 · answered Jul 16 '10 at 14:17

7

Storage allocation is another consideration. Every time you fill all of the buckets in a hash-table, you need to allocate new storage and re-hash everything. This can be avoided if you know the size of the data ahead of time. On the other hand, balanced trees don't suffer from this issue at all.

answered Jul 16 '10 at 14:17

Odrade

7,409
11
42
65

1

but the amortized cost of the operation still remains O(1), so do you really think it's a disadvantage? – Vaibhav Bajpai Jul 17 '10 at 07:29
3

In most cases, probably not. In real-time applications, yes. – Odrade Jul 19 '10 at 13:13

score 3 · Answer 3 · answered Jul 20 '11 at 22:50

Just wanted to add :

Balanced binary trees have a predictable time of fetching a data [log n] independent of the type of data. Many times that may be important for your application to estimate the response times for your application. [hash tables may have unpredictable response times]. Remember for smaller n's as in most common use cases the difference in performance in an in-memory look up is hardly going to matter and the bottle neck of the system is going to be elsewhere and sometimes you just want to make the system much simpler to debug and analyze.
Trees are generally more memory efficient compared to hash tables and much simpler to implement without any analysis on the distribution of input keys and possible collisions etc.

score 2 · Answer 4 · answered Jul 16 '10 at 13:45

2

In my humble opinion, self-balancing trees work pretty well as Academic topics. And I do not know anything that can be qualified as a "straight-forward implementation of a red-black tree".

In the real world, the memory wall makes them far less efficient than they are on paper.

With this in mind, hash tables are decent alternatives, especially if you don't practice them the Academic style (forget about the table size constraint and you magically resolve the table resize issue and almost all collision issues).

In a word: keep it simple. If that's simple for you then that's simple for your computer.

answered Jul 16 '10 at 13:45

Pierre

35
1

2

Could you explain how you can "forget about the table size contraint?" Are you just suggesting we ignore it, or do you mean something else? – Odrade Jul 16 '10 at 14:24
3

Do you count the C++ standard library implementation as an academic topic? or that of Java and .NET containers? Just so you know, std::map is mostly implemented as a balanced binary search tree; and gnu libstdc++ implements std::map as a Red & Black tree, IIRC. – Fingolfin Dec 02 '13 at 23:08
1

"keep it simple. If that's simple for you then that's simple for your computer." That is not very good advice, think Fibonacci Heap - highly complex and very efficient. Same goes for, for example, SSS* game tree search or Thorup path finding... – PawelP Nov 30 '14 at 21:35
1

@Odrade -- I'd think he means make your table so large that collisions are very unlikely and also don't worry about how full or empty it is (resizing to save memory). – Alex Leibowitz Sep 06 '19 at 14:23

score 1 · Answer 5 · answered Mar 13 '14 at 14:29

1

I think if you want to query for a range of keys instead of one key, self balanced tree structure will perform better than a hash table structure.

answered Mar 13 '14 at 14:29

Jegeyom

11
2

This sounds a lot like an opinion. Do you have any data / reference to support your statement? – Floris Mar 14 '14 at 19:53
1

Querying a range of keys in a hash table may imply knowing exactly where there are keys and where there are not. You would need to repeat the hashing operation for every possible key that you believe it is in the query interval. With trees, you can just locate the beginning of the interval and then traverse the tree in-order until the end of your query interval. – Cesar Sep 08 '14 at 15:09

Daniel Valland · Answer 6 · 2018-06-14T16:42:59.500

A few reasons I can think of:

Trees are dynamic (the space complexity is N), whereas hash tables are often implemented as arrays which are fixed size, which means they will often be initialized with K size, where K > N, so even if you only have 1 element in a hashmap, you might still have 100 empty slots that take up memory. Another effect of this is:
Increasing the size of an array-based hash table is costly (O(N) average time, O(N log N) worst case), whereas trees can grow in constant time (O(1)) + (time to locate insertion point (O(log N))
Elements in a tree can be gathered in sorted order (using ex: in-order-traversal). Thereby you often get a sorted list as a free perk with trees.
Trees can have a better worst-case performance vs a hashmap depending on how the hashmap is implemented (ex: hashmap with chaining will have O(N) worst case, whereas self-balanced trees can guarantee O(log N) worst case for all operations).

Both self-balanced trees and hashmaps have a worst-case efficiency of O(log N) in the best worst-case (assuming that the hashmap does handle colissions), but Hashmaps can have a better average-case performance (often close to O(1)), whereas Trees will have a constant O(log N). This is because even thou a hashmap can locate the insertion index in O(1), it has to account for hash colissions (more than one element hashing to the same array index), and thus in the best case degrades to a self-balanced tree (such as the Java implementation of hashmap), that is, each element in the hashmap can be implemented as a self-balanced tree, storing all elements which has hashed to the given array cell.

Hash tables v self-balancing search trees

6 Answers6

Linked