1

I was going through the Stanford algorithms course in which they had mentioned prime no. as a 'quick and dirty' method of hashing. So I was attempting to implement my own hash table class, but I am stuck at finding the best way to get the closest prime number to n (number of 'buckets'). Sieve of Eratosthenes is valid, but will take O(nloglogn) time complexity.

Is there any better way to go about this?

Antibody
  • 21
  • 3
  • The fastest non-deterministic test you can muster first, then a deterministic test when the former pans out as probable. Ex would be using a Miller-Rabin non-deterministic test first, and if it pans out, take that to a deterministic test to verify if you're paranoid. – WhozCraig Jun 05 '22 at 06:11
  • 1
    @WhozCraig Miller-Rabin is deterministic and perfect for small numbers by testing just a few bases. "if n < 4,759,123,141, it is enough to test a = 2, 7, and 61" and "if n < 18,446,744,073,709,551,616 = 2^64, it is enough to test a = 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, and 37." – Goswin von Brederlow Jun 05 '22 at 06:14
  • Posting as a comment because this doesn't answer your precise question: assuming that the prime number hash function H is of reasonable quality (e.g., 2-universal), you can just fix a convenient large prime p and use x |-> H(x) % n for the hash function. If you choose n to be a power of two, you can strength-reduce to x |-> H(x) & (n-1), which adds negligible cost compared to H. – David Eisenstat Jun 05 '22 at 21:49

4 Answers4

1

Your question says that you are trying to pick a prime number that is closest to the number of buckets. Presumably you are doing this to avoid wasting buckets...

That's not how you do it.

When you apply a prime number modulus to your hash, then you pick a prime number of buckets. The modulus is exactly the number of buckets.

The number of buckets you choose typically depends on the number of items in the table. Usually when this choice is made, the target table size is somewhere around 2n, where n is the number of items.

There is no need to be precise, however. If you have a hard-coded list of 256 primes, logarithmically distributed so that the ith one is about 2^(i/4), then you can just find the one that is closest to 2n and use that. It will be within 10% of the target value, and that is close enough.

Matt Timmermans
  • 53,709
  • 3
  • 46
  • 87
  • I see, But in that case, what is a good way to find out the closest prime no. to 2n? @Homer512 mentioned a good method, but I had hoped there would be a good algorithm to be used for this which I hadn't come across. – Antibody Jun 05 '22 at 15:11
0

You can use a prime number test like Miller Rabin primality test with a bit of randomness to pick the number of buckets. While the test runs in constant time for a single std::size_t you might have to try a few numbers. But resizing a hashtable is rare.

Alternatively you can just pre-compute a list of suitable primes for use as hashtable size, e.g. the closest prime to 2^n. You can still add randomness to the hash by using (key ^ seed) % prime.

Goswin von Brederlow
  • 11,875
  • 2
  • 24
  • 42
0

GCC just uses a precomputed LUT. No reason to recompute primes when you can just store an array and use std::lower_bound

In fact, they use two LUTs. The primary one is an array of prime numbers up to 2^64. The other is a small inline one that contains the last prime number <= index i which is useful in constructors and common cases of small dictionaries.

You don't need all prime numbers. Storing fewer just limits the number of buckets you may use, so your load factor may be approximated less precisely.

Homer512
  • 9,144
  • 2
  • 8
  • 25
  • 1
    the LUT doesn't include *all* primes up to 2^64. There are over 2^58 of those. – Matt Timmermans Jun 05 '22 at 12:23
  • @MattTimmermans You're right, I never even considered that. :-D Anyway, hash tables don't need all possible primes, just a reasonable selection to work with – Homer512 Jun 05 '22 at 12:28
0

Here's a very small amendment to Will Ness' answer's test code. It's in JavaScript, which has C-like syntax:

// stackoverflow.com/a/69345662/849891
//   by stackoverflow.com/users/849891/will-ness

function *primesFrom(start) {
    //console.log("Starting....");
    if (start <= 2) { yield 2; }
    if (start <= 3) { yield 3; }
    if (start <= 5) { yield 5; }
    if (start <= 7) { yield 7; }
    const sieve = new Map();
    const ps = primesFrom(2);
    ps.next();                 // skip the 2
    let p = ps.next().value;   // p==3
    let psqr = p * p;          // 9
    let c = psqr;              // first candidate, 9
    let s = 6;                 // step
    let m = 9;                 // multiple
    while( psqr < start )
    {
        // must adjust initial state
         s = 2 * p;
         m = p + s * Math. ceil( (start-p)/s );
         while (sieve.has(m)) m += s;
         sieve.set(m, s);
         p = ps.next().value;
         psqr = p * p;
    }
    if ( start > c) { c = start; }
    if ( c%2 === 0 ) { c += 1; }
    
    // main loop
    for ( ;  true ;  c += 2 ) 
    {
        s = sieve.get(c);
        if (s !== undefined) {
            sieve.delete(c);
        } else if (c < psqr) {
            yield c;
            continue;
        } else {          // c == psqr
            s = 2 * p;
            p = ps.next().value;
            psqr = p * p;
        }
        m = c + s;
        while (sieve.has(m)) m += s;
        sieve.set(m, s);
    }
}


(function test() {
    maxItems = 50;
    target = Math.floor(Math.random() * Math.pow(2, 40));
    console.log("Target: " + target);
    from = Math.floor(target - 5*Math.log(target));
  
    const ps = primesFrom( from );
    let left;
    let right;
    for(i = 0; i < maxItems; i=i+1)
    {
      p = ps.next().value;
      if (p <= target) {
        left = p;
      } else {
        right = p;
        break;
      }
    }
    console.log(left + ", " + right);
})();
גלעד ברקן
  • 23,602
  • 3
  • 25
  • 61