1

I ran into an example in Computer Science Course.

suppose we use Hashing with chaining and use table of size m. the hash function map record with key k into k mod m slot. if we know the record keys is subset of {i^2 | 1 <= i <= 100}, for which value of m cost of search is lower in worst case?

a) 11

b) 7

c) 9

d) 12

My TA says (1) is true, but i thin this is false. infact i have no idea how we get this ! any idea?

2 Answers2

1

You can empirically check it with a simple code:

    int[] mVals = {11, 7, 9, 12};
    for (int m : mVals) { 
        int[] cells = new int[m];
        for (int i = 1; i<= 100; i++) {
            int x = i*i % m;
            cells[x]++;
        }
        System.out.println("m=" + m + " cells=" + Arrays.toString(cells));
    }

Will yield:

m=11 cells=[9, 19, 0, 18, 18, 18, 0, 0, 0, 18, 0]
m=7 cells=[14, 29, 28, 0, 29, 0, 0]
m=9 cells=[33, 23, 0, 0, 22, 0, 0, 22, 0]
m=12 cells=[16, 33, 0, 0, 34, 0, 0, 0, 0, 17, 0, 0]

Since your values are in the specified range, you can see that the "worst" cell in the m=11 table has probability of 19/100 for elements to be inserted into it, while for all other values of m - the highest probability is higher.


As for why, there are several factors at hand:

  1. Larger value of m is usually preferred - to understand it, make sure you understand what happens when m=1 (all elements are in one list), or m=2 (half of the elements in each of the two lists)
  2. Prime numbers are preferred and "behave nicer" for hash functions. This topic is discussed thoroughly in the thread Why should hash functions use a prime number modulus?. The idea is prime numbers are less vulnurable for bias from a specific domain of elements, and your set of squared numbers is one example of such.
Community
  • 1
  • 1
amit
  • 175,853
  • 27
  • 231
  • 333
  • Wow, @amit, you are so clever, nice idea. but how i can prove it in theoretical way? –  Feb 24 '15 at 13:35
  • @AliMovagher I agree that the top answer for the linked question is not rigorous. Here the considerations boil down to number theory. Mod an odd prime m = p, the asymptotic distribution is 2/m for all quadratic residues except 0, 1/m for 0, and nothing in the non-residue buckets. A modulus m = p^2 is bad because all multiples of p fall in bucket 0, for a fraction of 1/sqrt(m). Ditto for higher powers and their multiples. Squarefree composites are OK, but slightly worse than primes for reasons that this comment box is too small to describe. – David Eisenstat Feb 24 '15 at 15:39
  • @DavidEisenstat, i love to theoretical learning :) would you please submit as an answer? infact i want to learn more formal :) –  Feb 24 '15 at 15:42
  • @AliMovagher I got it slightly wrong. I don't have time to do a proper answer, so why don't you ask math.SE how to characterize the set of moduli m such that x^2 = c mod m has at most two solutions, citing this question as context? – David Eisenstat Feb 24 '15 at 16:12
0

As a general rule of thumb, you usually want a large prime number to do modulo by in hashing, because it will generate more distinct values and those lead to more uniform distribution across the slots of your hash table. Since 11 is the largest prime in your list, intuitively that's going to be the best.

For your particular problem, we have the records:

1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, ..., 10000

We have to find, for each of your options, how many distinct values numbers from this set generate modulo your variants.

For 11:

if n mod 11 = 0 => n*n mod 11 = 0
if n mod 11 = 1 => n*n mod 11 = 1 (1)
if n mod 11 = 2 => n*n mod 11 = 4 (2)
            = 3               = 9 (3)
            = 4               = 5 (4)
            = 5               = 3 (5)
            = 6               = 3
            = 7               = 5
            = 8               = 9
            = 9               = 4
            = 10              = 1

For 7:

if n mod 7 = 0 => n*n mod 7 = 0 (1)
           = 1              = 1 (2)
           = 2              = 4 (3)
           = 3              = 2 (4)
           = 4              = 2
           = 5              = 4
           = 6              = 1

For 9:

if n mod 9 = 0 => n*n mod 9 = 0 (1)
           = 1              = 1 (2)
           = 2              = 4 (3)
           = 3              = 0
           = 4              = 7 (4)
           = 5              = 7
           = 6              = 0
           = 7              = 4
           = 8              = 1

Similarly for 12. As you can see, 11 is the one generating more distinct values for squares, so it will spread your values more uniformly across your hash table's bins. More uniform spread results in lower cost of search in the worst case.

IVlad
  • 43,099
  • 13
  • 111
  • 179