20

When implementing a Hashtable using an array, we inherit the constant time indexing of the array. What are the reasons for implementing a Hashtable with a Binary Search Tree since it offers search with O(logn)? Why not just use a Binary Search Tree directly?

Michael
  • 791
  • 2
  • 12
  • 32
  • 1
    possible duplicate of [Advantages of Binary Search Trees over Hash Tables](http://stackoverflow.com/questions/4128546/advantages-of-binary-search-trees-over-hash-tables) – Abdullah Shoaib Apr 10 '14 at 18:50
  • 13
    Abdullah, I particularly asked about implementing a hastable using a binary search tree. Not Hashtables vs BSTs. – Michael Apr 10 '14 at 18:52
  • 2
    What is a Hashtable with Binary Search Trees? Are the keys hashed and stored in array as well as arranged in a tree? Or are the elements in each bucket stored as a tree rather than as a list? – Miserable Variable Apr 10 '14 at 18:57
  • 2
    What is the motivation of the question? Have you encountered a supposed hash table implemented with a BST? Or maybe one that's using a BST on each bucket to make searching through the collisions faster? – Adrian McCarthy Apr 10 '14 at 20:12
  • @MiserableVariable For all intents and purposes, it's implementing a hashtable using a BST. To the end user, it looks/works just like a "real" hashtable, albeit with slightly different performance considerations. – Joel B Apr 05 '17 at 19:47
  • @AdrianMcCarthy I had someone mention to me that certain hashtable implementations are actually done as BSTs. I'd be curious to know if this was actually the case, or if they were mistaken. – Joel B Apr 05 '17 at 19:47
  • @Joel B: The question remains unclear. If you replace the array of buckets with a binary search tree, I don't think that would be called a hash table, but some other sort of associative map. And the differences in lookup cost and memory can substantial rather than slight. If you replace lists of collisions with BSTs of collisions, it seems unlikely to have a substantial effect on performance and may have a negative effect on memory usage. – Adrian McCarthy Apr 05 '17 at 20:00

4 Answers4

27

If the elements don't have a total order (i.e. the "greater than" and "less than" is not be defined for all pairs or it is not consistent between elements), you can't compare all pairs, thus you can't use a BST directly, but nothing's stopping you from indexing the BST by the hash value - since this is an integral value, it obviously has a total order (although you'd still need to resolve collision, that is have a way to handle elements with the same hash value).

However, one of the biggest advantages of a BST over a hash table is the fact that the elements are in order - if we order it by hash value, the elements will have an arbitrary order instead, and this advantage would no longer be applicable.

As for why one might consider implementing a hash table using a BST instead of an array, it would:

  • Not have the disadvantage of needing to resize the array - with an array, you typically mod the hash value with the array size and resize the array if it gets full, reinserting all elements, but with a BST, you can just directly insert the unchanging hash value into the BST.

    This might be relevant if we want any individual operation to never take more than a certain amount of time (which could very well happen if we need to resize the array), with the overall performance being secondary, but there might be better ways to solve this problem.

  • Have a reduced risk of hash collisions since you don't mod with the array size and thus the number of possible hashes could be significantly bigger. This would reduce the risk of getting the worst-case performance of a hash table (which is when a significant portion of the elements hash to the same value).

    What the actual worst-case performance is would depend on how you're resolving collisions. This is typically done with linked-lists for O(n) worst case performance. But we can also achieve O(log n) performance with BST's (as is done in Java's hash table implementation if the number of elements with some hash are above a threshold) - that is, have your hash table array where each element points to a BST where all elements have the same hash value.

  • Possibly use less memory - with an array you'd inevitably have some empty indices, but with a BST, these simply won't need to exist. Although this is not a clear-cut advantage, if it's an advantage at all.

    If we assume we use the less common array-based BST implementation, this array will also have some empty indices and this would also require the occasional resizing, but this is a simply memory copy as opposed to needing to reinsert all elements with updated hashes.

    If we use the typical pointer-based BST implementation, the added cost for the pointers would seemingly outweigh the cost of having a few empty indices in an array (unless the array is particularly sparse, which tends to be a bad sign for a hash table anyway).

But, since I haven't personally ever heard of this ever being done, presumably the benefits are not worth the increased cost of operations from expected O(1) to O(log n).

Typically the choice is indeed between using a BST directly (without hash values) and using a hash table (with an array).

Bernhard Barker
  • 54,589
  • 14
  • 104
  • 138
  • 3
    While I see where you're coming from, I think either benefit (no-resize insertion or less memory), though perhaps not both, can be achieved more effectively with ordinary hash tables: If O(n) resizing is prohibitive, one can avoid it by having two tables and moving a small constant number of entries on *each modification*. Likewise, if memory is at premium, an open addressing hash table can be *very* tightly packed at decent (but non-optimal) performance, such that each BST node is several times larger than an array slot and the load factor exceeds 90% –  Apr 10 '14 at 22:43
  • what is total order? Still didn't get it can you update with an example? – abhimanyuaryan Sep 15 '18 at 18:41
  • @AbhimanyuAryan See: [What is total order - explanation please](https://math.stackexchange.com/q/239118) and [How can you explain partial order and total order in simple terms?](https://www.quora.com/How-can-you-explain-partial-order-and-total-order-in-simple-terms) I'm not sure, however, whether or not there exists collections for which there is no total order possible (the above mostly relates to whether or not a specific order is a total order), not that this is particularly relevant, as, when you define an ordering, you tend to do that for a reason. – Bernhard Barker Sep 15 '18 at 19:48
  • is there any BST implementation of Hash Table instead of Linked List implementation. I'm confused to represent key value pairs on BST. Do you have any resource where I can algorithm or code? – abhimanyuaryan Sep 15 '18 at 22:18
  • 2
    @AbhimanyuAryan As mentioned in my answer, Java's HashMap ([code here](http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/classes/java/util/HashMap.java)) uses a BST instead of a linked-list in some cases. Although that's a lot of code - I'd probably recommend trying to understand it primarily via the comments (and maybe [this post I linked to](https://stackoverflow.com/questions/30164087/how-does-java-8s-hashmap-degenerate-to-balanced-trees-when-many-keys-have-the-s) can help). – Bernhard Barker Sep 15 '18 at 22:31
  • Why would a hashtable not have order? don't you need to specify data type of key and value during initialisation? and elements of same data type can always be ordered. – Abhishek Choudhary Mar 25 '22 at 07:55
2

Pros:

  1. Potentially use less space b/c we don't allocate a large array
  2. Can iterate through the keys in order, sometimes useful

Cons:

  1. You'd have O(log N) lookup time, which is worse than the guaranteed O(1) for a chained hash table.
bluepanda
  • 382
  • 2
  • 13
  • 3
    Worst case for a chained hash table is O(N). It occurs when every value results in a collision. Assuming you replace the array with either a balanced binary tree or a trie, that arguably results in a constant time lookup because the number of bits in the key is fixed and is not directly dependent upon the number of items in the hash. – dgatwood Oct 17 '16 at 18:29
0

Since the requirements of a Hash Table are O(1) lookup, it's not a Hash Table if it has logarithmic lookup times. Granted, since collision is an issue with the array implementation (well, not likely an issue), using a BST could offer benefits in that regard. Generally, though, it's not worth the tradeoff - I can't think of a situation where you wouldn't want guaranteed O(1) lookup time when using a Hash Table.

Alternatively, there is the possibility of an underlying structure to guarantee logarithmic insertion and deletion via a BST variant, where each index in the array has a reference to the corresponding node in the BST. A structure like that could get sort of complex, but would guarantee O(1) lookup and O(logn) insertion/deletion.

Phillip Carter
  • 4,895
  • 16
  • 26
  • 1
    You're just arguing semantics regarding what it's called and whether it can exist, not actually answering **why** you'd want to use it. – Bernhard Barker Apr 10 '14 at 22:35
  • 1
    The time is logarithmic in the length of the hash key (which is constant), not logarithmic in the number of elements in the hash. So it would still be a constant-time algorithm. – dgatwood Oct 17 '16 at 18:32
  • @dgatwood, I think you're mistaken. How is it logarithmic in the length of the hash key? How can you find which node in the tree should you return just by looking the hash value of a key? Why are we using binary search tree, then? Let's use Binary tree if no search is necessary. I think it's O(logn) where n is the number of element in the tree not the length of the hash key. – Baskaya May 02 '18 at 20:32
  • 1
    If I understand the original question correctly, it is about storing the buckets themselves in a binary tree based on the bits in the hash result. Of course, the reason I would consider doing something like that would be to save memory by not storing the buckets that aren't in use, whereas the approach described, I believe, uses either the same amount of memory or more memory, depending on how you interpret it, so I don't see the benefit of the approach described in the question versus a simple indexed lookup, but that's a separate discussion. – dgatwood May 03 '18 at 19:03
0

I found this looking to see if anyone had done it. I guess maybe not.

I came up with an idea this morning of implementing a binary tree as an array consisting of rows stored by index. Row 1 has 1, row 2 has 2, row 3 has 4 (yes, powers of two). The advantage of this structure is a bit shift and addition or subtraction can be used to walk the tree instead of using extra memory to store bi- or uni-directional references.

This would allow you to rapidly search for a hash value based on some sort of hashable input, to discover if the value exists in some other store. Or for a hash collision (or partial collision) search. I can't think of many other uses for it but for these it would be phenomenally fast. Very likely a lot of the rotation operations would happen entirely in cpu cache and be written out in nice linear blobs to main memory.

Its main utility would be with sorting input values of a random nature. If the blobs in the array were two parts, like a hash, and an identifier for another store, you could do the comparisons very fast and insert very fast to discover where an item bearing a hash value is kept in another location (like the UUID of a filesystem node or maybe even the filename, or other short identifiable string).

I'll leave it to others to dream of other ways to use it but I'm using it for a graph theoretic proof of work search table for identifying partial collisions for a variant of Cuckoo Cycle.

I am just now working on the walk formula, and here it is:

i = index of array element

Walk Up (go to parent):

i>>1-(i+1)%2

(Obviously you probably need to test if i is zero)

Walk Left (down and left):

i<<1+2

(this and the next would also need to test against 2^depth of the structure, so it doesn't walk off the edge and fall back to the root)

Walk Right (down and right):

i<<1+1

As you can see, each walk is a short formula based on the index. A bit shift and addition for going left and right, and a bit shift, addition and modulus for ascending. Two instructions to move down, 4 to move up (in assembler, or as above in C and other HLL operator notation)

edit: I can see from further commentary that the benefit of slashing the insert time definitely would be of benefit. But I don't think that a conventional vector based binary tree would provide nearly as much benefit as a dense version. A dense version, where all the nodes are in a contiguous array, when it is searched, naturally will travel in a linear fashion through the memory, which should help reduce cache misses and thus reduce the latency of the searches significantly, as well as the fact that there is a latency hit with memory in accessing randomly compared to streaming through blocks sequentially.

https://github.com/calibrae-project/bast/blob/master/pkg/bast/bast.go

This is my current state of a WiP to implement what I am calling a Bifurcation Array Search Tree. For the purpose of a fast insert/delete and not horribly slow search through a sorted collection of hashes, I think that this would be of quite large benefit for cases where there is a lot of data coming and going through the structure, or more to the point, beneficial for more realtime applications.

Louki Sumirniy
  • 106
  • 1
  • 1
  • 6