1

I am building a hash table, where the key is a phone number (here are some of them):

6948060987
6960780800
6963208768
6944870406
6947279288
6953691771
6956094283
6947092062
6960086297
6947719197
6951516975
6957531584
6969211184
6963238579
6957054322
6952077216
6956907738

The number of entries will be 200, 2000, 20000 and 2000000 and the entries will be unique.

About the size of the table, I am following this answer.

I store the phone number as an array of char's. I noticed that all the numbers begin with 69, so I can skip them in the hash function.

My attempt is to take the sum of the digits and do a modulo with the number of cells in the hash table, but it seems (on paper) that this is a bad function, since there are many collisions.

How should I modify my hash function to get better results (less collisions)?

Community
  • 1
  • 1
gsamaras
  • 71,951
  • 46
  • 188
  • 305
  • 3
    You could use integer representation of a phone number as its hash. – AlexD Oct 14 '14 at 20:02
  • @AlexD This would avoid collisions... unless two people happen to share a phone number, which can happen – Jon Kiparsky Oct 14 '14 at 20:04
  • So you mean @AlexD that the function will find the integer representation? Jon the phones are unique, I will update. – gsamaras Oct 14 '14 at 20:05
  • @G.Samaras Yes, just `atoi` or alike. – AlexD Oct 14 '14 at 20:07
  • @JonKiparsky two people having the same phone number is irrelevant when the phone number has already been chosen as the hash key. The problem is to choose a good hash function for the chosen key. – John Bollinger Oct 14 '14 at 20:08
  • @AlexD, so the body of the function you suggest would be something like this: `int h(char* phone) { return atoi_or_alike(phone) % hast_table.size;}` ? – gsamaras Oct 14 '14 at 20:09
  • Whether `atoi()` or `atol()` (or something else) is better depends on the type of the desired hash result. – John Bollinger Oct 14 '14 at 20:09
  • @G.Samaras I would skip `% hast_table.size`. – AlexD Oct 14 '14 at 20:10
  • @G.Samaras: that should work well, but you might prefer to use a hash table whose size is prime. – rici Oct 14 '14 at 20:10
  • @AlexD yes, skip the modulo in the hash function computation. It instead goes in the code that chooses a hash bucket based on the hash result. – John Bollinger Oct 14 '14 at 20:11
  • @rici, in the linked answer, I had two answers where the debate if the prime will make any difference or not. I think it will. Alex, how will then the function map every phone to a cell of the hash table? – gsamaras Oct 14 '14 at 20:12
  • @G.Samaras the argument over whether the number of buckets should be prime is essentially about distributing entries evenly among the buckets. If the keys' hashes are uncorrelated and evenly distributed over the value space of their type, then it doesn't matter for that purpose whether the number of buckets is prime, but otherwise they are somewhat more likely to be spread evenly among a prime number of buckets than among a composite number of buckets. – John Bollinger Oct 14 '14 at 20:19
  • @G.Samaras Ah, it is hash _table_. Yes, you could try `% hast_table.size`. – AlexD Oct 14 '14 at 20:21
  • @JohnBollinger: Telephone numbers (in the North American Numbering Plan) are of the form AAACCCNNNN where both AAA and CCC are of the form NXX where X is any digit except that XX cannot be 11, and N is a digit other than 0 or 1. For historical reasons, some values of CCC and many values of AAA are much less common than others because originally they came from disjoint sets. In short, the numbers are not evenly distributed over the value space of 10-digit numbers (although the deviation is less marked than it was when I was younger). – rici Oct 14 '14 at 20:30
  • @G.Samaras: In addition to my comment addressed to JohnBollinger, the distribution of area codes and possibly even central office codes will be biased if the geographical distribution of the phone numbers is biased, which is pretty common. I don't know anything about your application, so use your own judgement. – rici Oct 14 '14 at 20:43
  • @rici, the phones all have the format I specified in my answer. Don't worry about these statistics. The people reading this question should be focused on the numbers I provided. – gsamaras Oct 14 '14 at 20:45
  • @G.Samaras: The numbers you provided do not indicate a uniform distribution over the number space :) But it is not clear whether that is a random sample of your keys or just a slice. Anyway, using a prime number costs nothing and is safer. – rici Oct 14 '14 at 20:47
  • I will agree on the prime case @rici. It's just a slice. However, the question is about the hashing function, but thanks for the information. :) – gsamaras Oct 14 '14 at 20:49
  • @G.Samaras: with respect to the question, as I said, `atol(key)%sz` is fine (provided that longs are 64 bits; `atoll` would be safer), but it would be better if sz were prime. (Or at least didn't have a power of 10 as a factor.) – rici Oct 15 '14 at 04:27

1 Answers1

3

Why do you need to a non-standard hash function at all?

There are plenty of hash functions which are well tested and have known properties which will work fine for any input, thus will also work well for phone numbers, which are after all a subset of ASCII strings. Is your application so time critical that you need to design your own hash function and risk something with more collisions? If not, why not use one of the well known hash functions?

For instance, if you need something with cryptographically demonstrable collision resistance, use SHA-256 (truncated if you want). If you are not worried about an adversary, use something like universal hashing. Unless your problem is very specialised, you will be better off using someone else's well tested hash algorithm than trying to invent one yourself.

An even easier hash is the original hash perl used, which worked as follows:

# Return the hashed value of a string: $hash = perlhash("key")
# (Defined by the PERL_HASH macro in hv.h)
sub perlhash
{
    $hash = 0;
    foreach (split //, shift) {
          $hash = $hash*33 + ord($_);
    }
    return $hash;
}

In English, it takes the current hash value, multiplies by 33, and adds the ASCII value of the next character on. It's not a great hash, but it worked for perl for a long while.

abligh
  • 24,573
  • 4
  • 47
  • 84
  • I stated why I want to modify my hash function (it causes many collisions). I would like to use a function from the ones you describe. I am not looking for the super ideal hash function, but for a good hash function. SHA-256 seems too much for me. The link seems nice. `p` is the `hash_table.size` I guess. What is `a`? – gsamaras Oct 14 '14 at 21:48
  • @G.Samaras sorry - I meant "why do you need a non-standard hash function at all", i.e. why not start with and stick with a standard one. I have clarified the answer. – abligh Oct 14 '14 at 22:05
  • It's ok. What about the parameters I mentioned. – gsamaras Oct 14 '14 at 22:06
  • See the sentence above the code. `p` is a prime much larger than either 255 (maximum of each value) or the number of bins, and `a` is randomly chosen and less than `p`. You will have to take the whole lot `mod m` at the end where `m` is the number of bins. I've also posted an even easier algorithm perl uses (again you will have to take the result `mod m` at the end. – abligh Oct 14 '14 at 22:15
  • How I choose `p`? Randomly but with restrictions? – gsamaras Oct 14 '14 at 22:22
  • According to the article, `p` must be prime, and much larger than either 255 or the number of bins. The article suggests using a Mersenne prime (e.g. 2^61-1). Note the perl program I provided is effectively doing the same thing using `a=33` and an assumed prime number of bins. Java appears to use `a=31`: http://www.cs.cmu.edu/~adamchik/15-121/lectures/Hashing/hashing.html – abligh Oct 14 '14 at 22:31