0

How it can be O(1)? If you have an empty hash table it is obviously constant but when number of elements increases and collisions starts, wouldn't also it increase complexity?

I mean search time will increase for tables with more elements, how is this constant?

Carlo V. Dango
  • 13,322
  • 16
  • 71
  • 114
  • The collision rate depends on the fill factor. That is a constant you choose. (basically the big O ) – joop Apr 11 '14 at 10:17
  • @joop What do you mean by fill factor? –  Apr 11 '14 at 10:41
  • The (number of items) / (number of Buckets) – joop Apr 11 '14 at 11:22
  • But number of items is not constant, so fill factor is not really a constant –  Apr 11 '14 at 12:37
  • Normally, you _design_ your HT for a certain capacity. In cases where you _really_ don't know the size in advance, you canjust choose a size and grow/shrink the table when needed. (this will effectively lead to O log(n) behaviour.) – joop Apr 11 '14 at 13:06
  • Related: ["Why are hash table expansions usually done by doubling the size?"](http://stackoverflow.com/questions/2369467/why-are-hash-table-expansions-usually-done-by-doubling-the-size). – David Cary Aug 14 '16 at 23:49
  • Related: ["For what kind of data are hash table operations O(1)?"](http://cs.stackexchange.com/questions/477/for-what-kind-of-data-are-hash-table-operations-o1) – David Cary Aug 15 '16 at 05:33
  • Is this a duplicate of https://stackoverflow.com/questions/2771368/can-hash-tables-really-be-o1 ? – David Cary Mar 30 '18 at 21:22

1 Answers1

0

The extremely simple toy "static" hash table implementations often used to introduce students to hash tables do have the problems you pointed out.

In practice, many popular hash algorithm implementations are "dynamic" -- rather than allowing the table to fill up and the hash chains to get longer and longer, leading to the problems you pointed out, during an insert() they pro-actively recognize when there are "too many" collisions, and then do something about it. Often they don't bother trying out to figure out why there are so many collisions -- i.e., they don't know or care if the real problem is (a) or (c) -- they go ahead and do both (b) and (d):

There are two different ways that a hash table can get lots of collisions, and both of them have known solutions:

  • (a) If there's a lot of data items, and not very many buckets, there will be collisions. A (separate chaining) hash table with 100 slots, after trying to insert 150 items, will inevitably have dozens of collisions. (Because of the birthday effect, you'll probably get dozens of collisions a lot sooner than that). To avoid this problem,

  • (c) With the particular items it needs to store, the particular hash function currently in use "coincidentally" sends many items to the same bucket, many more than one would expect from the birthday effect. To avoid this problem,

    • (d) switch to a different randomly-picked hash function from a universal family of hash functions. For any particular set of data -- even if the data has been carefully selected so the current hash function sends most of the data to the same bucket -- most of the other hash functions in a universal family will spread that particular set of data out much more evenly.

In order to prevent the problems you pointed out, practical in-RAM hash table implementations have code in the insert() function that occasionally triggers some "extra work" resize-by-copying-all-entries which requires O(n+k) time to do (c) and (d), that happens so rarely that that n inserts require a total time of O(n), so the average insert time is O(1).

Many hash table implementations -- implementations of separate chaining, linear probing, etc. -- resize often enough that hash table lookups require O(1) comparisons on average, although they have no guarantees on the worst-case hash-table lookup time.

A few hash table implementations -- implementations of hopscotch hashing, cuckoo hashing, dynamic perfect hashing, etc. -- do even more "extra work" in the form of (b) and repeating (d) however many times is necessary to guarantee that hash table lookups require O(1) comparisons even in the worst case.

Community
  • 1
  • 1
David Cary
  • 5,250
  • 6
  • 53
  • 66