3

Does anyone know of a perfect hashing function for URLs with 64-bit integers that would perform well for most URLs?

Kristopher Ives
  • 5,838
  • 7
  • 42
  • 67
  • 7
    If it is a perfect hash, by definition it performs well. – Ira Baxter Jun 08 '10 at 05:31
  • 1
    Why won't any basic string hashing function do? URLs are just strings and look pretty much like other strings to me. Any good string hashing function should perform extremely well if the number of buckets is decent compared to the load factor. – Ira Baxter Jun 08 '10 at 05:33
  • 1
    @Ira Baxter Sorry, I mean performs well in terms of hash size compared to acceptable URL patterns. A "Perfect Hashing Function", from my understanding, performs a mapping without collisions for certain inputs. – Kristopher Ives Jun 08 '10 at 06:22
  • ... for certain inputs. Yeah, but URLs have a vast number of combinations; is there is small, finite sized subset that you want perfectly hashed? If not, then you aren't going to get a perfect hash function. The real question is, does it matter? You didn't answer the question about why a basic string hash won't work for you. – Ira Baxter Jun 08 '10 at 08:11
  • 2
    @Ira Baxter You might want to see http://en.wikipedia.org/wiki/Perfect_hash_function – Kristopher Ives Jun 08 '10 at 23:40

2 Answers2

3

It is impossible to create a perfect hash function if you don't know the set of keys that you will query in advance. If you know that, then you could use a something like gperf or cmph to generate the perfect hash function for you.

http://cmph.sourceforge.net/

I assume that a perfect hash function is not really what you want, so it suffices for you to use any reasonable hash function out there, like murmur hash or bob jenkins hash, with a hash table implementation, like __gnu_cxx::hash_map or dense_hash_map and sparse_hash_map from google.

http://code.google.com/p/google-sparsehash/ http://sites.google.com/site/murmurhash/ http://burtleburtle.net/bob/hash/doobs.html

Davi
  • 507
  • 5
  • 15
2

Found this marked as a "Base52 url shortener perfect hash function in C" from http://lambdajones.com/b52

  const char *b52idx[52] = {
  "0", "1", "2", "3", "4", "5", "6", "7", "8", "9",
  "B", "C", "D", "F", "G", "H", "J", "K", "L", "M",
  "N", "P", "Q", "R", "S", "T", "V", "W", "X", "Y",
  "Z", "b", "c", "d", "f", "g", "h", "j", "k", "l",
  "m", "n", "p", "q", "r", "s", "t", "v", "w", "x",
  "y", "z"
};

#define X 0xff
const int b52map[128] = {
   X, X, X, X, X, X, X, X, X, X, X, X, X, X, X, X,
   X, X, X, X, X, X, X, X, X, X, X, X, X, X, X, X,
   X, X, X, X, X, X, X, X, X, X, X, X, X, X, X, X,
// 0  1  2  3  4  5  6  7  8  9
   0, 1, 2, 3, 4, 5, 6, 7, 8, 9, X, X, X, X, X, X,
//       B  C  D     F  G  H     J  K  L  M  N
   X, X,10,11,12, X,13,14,15, X,16,17,18,19,20, X,
// P  Q  R  S  T     V  W  X  Y  Z
  21,22,23,24,25, X,26,27,28,29,30, X, X, X, X, X,
//       b  c  d     f  g  h     j  k  l  m  n
   X, X,31,32,33, X,34,35,36, X,37,38,39,40,41, X,
// p  q  r  s  t     v  w  x  y  z
  42,43,44,45,46, X,47,48,49,50,51, X, X, X, X, X
};

#ifdef __GNUC__
#define likely(x) __builtin_expect((x),1)
#else
#define likely(x) (x)
#endif

/*
  valid from 00000 -> zzzzz, good for 380204032 urls
  returns the integral short url id
*/
unsigned long long b52(const char *c) {
  unsigned long long x = 0;
  unsigned long long y = 0;
  unsigned long long z = 0;

  x |= b52map[c[0]] << 24 | b52map[c[1]] << 18 | \
       b52map[c[2]] << 12 | b52map[c[3]] << 6  | b52map[c[4]];

  y += (x/64) * 12;
  if (x > 4095) y += 624 * (x/4096);
  if (x > 262143) y += 32448 * (x/262144);
  if (x > 16777215) y += 1687296 * (x/16777215);

  if (likely((z = x - y) < 380204033)) return z;
  else return 380204033;
}

void b52inc(char *id) {
  int x[5] = {
    b52map[id[0]], b52map[id[1]], b52map[id[2]],b52map[id[3]], b52map[id[4]]
  };
  int y = 5;

  // search for the first character we can increment (51 == 'z')
  // increment from the b52idx table and update id
  while (y--) if (x[y] < 51) break;
  id[y] = *b52idx[++x[y]];

  // if we passed over id's 'z's above, roll them over
  while (y++ < 5) if (x[y] == 51) id[y] = '0';
}
Kristopher Ives
  • 5,838
  • 7
  • 42
  • 67