3

When implementing a hash table using a good hash function (one where the probability of any two elements colliding is 1 / m, where m is the number of buckets), it is well-known that the average-case running time for looking up an element is Θ(1 + α), where α is the load factor. The worst-case running time is O(n), though, if all the elements end up put into the same bucket.

I was recently doing some reading on hash tables and found this article which claims (on page 3) that if α = 1, the expected worst-case complexity is Θ(log n / log log n). By "expected worst-case complexity," I mean, on expectation, the maximum amount of work you'll have to do if the elements are distributed by a uniform hash function. This is different from the actual worst-case, since the worst-case behavior (all elements in the same bucket) is extremely unlikely to actually occur.

My question is the following - the author seems to suggest that differing the value of α can change the expected worst-case complexity of a lookup. Does anyone know of a formula, table, or article somewhere that discusses how changing α changes the expected worst-case runtime?

GEOCHET
  • 21,119
  • 15
  • 74
  • 98
templatetypedef
  • 362,284
  • 104
  • 897
  • 1,065
  • The general analysis is complicated off-the-cuff (for me at least). I'm not sure of any references to solving this problem in particular, so there might be localised simplifications. In any case, note that `L` the maximum list length is `max_x l(x)` where `l(x)` is the list length of the slot `x` is placed into (same notation as there). `l(x)` is the sum of Bernoulli trials, i.e. `l(x) ~ Bin(1/m,n)` has a Binomial distribution. The maximum of all of these random variables, of which there are `n`, is thus the nth order statistic. – davin May 14 '12 at 09:50
  • There are plenty of formulae for discrete order statistic distributions around the internet that you can put together some (rather ugly) formulae, or calculate tables for various values.. – davin May 14 '12 at 09:53
  • I’m unclear on the difference between expected case (= average case) and expected worst-case. Could you expound on that? – Konrad Rudolph May 14 '12 at 13:21
  • @KonradRudolph- Sure! Think of it this way. Suppose you use a good hash function to distribute n elements into m bins. The expected runtime for a lookup is the average number of objects you'd expect to see in any one bucket (this is n/m). The expected worst-case runtime is the average number of elements in the bucket containing the maximum number of objects. One way to get this value would be to run lots of trials of distributing n elements into m bins and counting how many elements are in the most filled bucket after each trial – templatetypedef May 14 '12 at 14:59

2 Answers2

3

For fixed α, the expected worst time is always Θ(log n / log log n). However if you make α a function of n, then the expected worst time can change. For instance if α = O(n) then the expected worst time is O(n) (that's the case where you have a fixed number of hash buckets).

In general the distribution of items into buckets is approximately a Poisson distribution, the odds of a random bucket having i items is αi e / i!. The worst case is just the m'th worst out of m close to independent observations. (Not entirely independent, but fairly close to it.) The m'th worst out of m observations tends to be something whose odds of happening are about 1/m times. (More precisely the distribution is given by a Β distribution, but for our analysis 1/m is good enough.)

As you head into the tail of the Poisson distribution the growth of the i! term dominates everything else, so the cumulative probability of everything above a given i is smaller than the probability of selecting i itself. So to a good approximation you can figure out the expected value by solving for:

αi e-α / i! = 1/m = 1/(n/α) = α/n

Take logs of both sides and we get:

i log(α) - α - (i log(i) - i + O(log(i)) = log(α) - log(n)
log(n) - α = i log(i) - i - i log(α) + O(log(i))

If we hold α constant then this is:

log(n) = i log(i) + O(i)

Can this work if i has the form k log(n) / log(log(n)) with k = Θ(1)? Let's try it:

log(n) = (k log(n) / log(log(n))) (log(k) + log(log(n)) - log(log(log(n)))) + O(log(log(n)))
       = k (log(n) + o(log(n)) + o(log(n))

And then we get the sharper estimate that, for any fixed load average α, the expected worst time is (1 + o(1)) log(n) / log(log(n))

btilly
  • 43,296
  • 3
  • 59
  • 88
1

After some searching, I came across this research paper that gives a complete analysis of the expected worst-case behavior of a whole bunch of different types of hash tables, including chained hash tables. The author gives as an answer that the expected length is approximately Γ-1(m), where m is the number of buckets and Γ is the Gamma function. Assuming that α is a constant, this is approximately ln m / ln ln m.

Hope this helps!

templatetypedef
  • 362,284
  • 104
  • 897
  • 1,065