0

I have a program which uses a hashtable and chaining to save a maximum of 100.000 strings. Each string has a maximum length of 20 and is made of a mix of (a-z) and (1-9) and is unique.

To save it into the hashtable I convert the strings to a number by adding up the ASCII values of their characters. This means the highest number possible is 2440 (20 times z).

I am having trouble thinking of the best hash function for this (algorithm efficiency wise). I have searched on Google and I haven't found a lot of detailed information about how to find the best hash function.

I have tried using hash % m (in which m is a prime number) as a hashfunction but my program is still quite slow.

user1760791
  • 403
  • 3
  • 12
  • 4
    Why not just use `HashSet` with the built in algorithm for strings? – Scott Chamberlain Jun 08 '15 at 13:31
  • 1
    Its an assignment for university and we are not allowed to use certain built in methods. – user1760791 Jun 08 '15 at 13:32
  • 3
    If it's a school assignment then the ethical thing to do would be to show us what you've tried so far and ask for suggestions on how to fix _specific problems_. – D Stanley Jun 08 '15 at 13:34
  • First: adding up the ASCII Values would not help, because 2 string with different character order would have the same hashcode, example "ABC" and "BCA" – yahya el fakir Jun 08 '15 at 13:37
  • I suggest to close this question for ethical reason: it sounds like a request for code-writing service pertinent to some educational task. Also, it's overly broad and opinion-based. Thanks for understanding. Best regards, – Alexander Bell Jun 08 '15 at 13:39
  • @yahyaelfakir That's not a problem for a hash - equal objects *must* have the same hash, but non-equal objects *can* have the same hash. Doesn't mean it's a reasonable hashing method, of course; but even that depends on the kind of data the OP is working with... – Luaan Jun 08 '15 at 13:40
  • Your current algorithm is likely resulting in a lot of collisions. There are a huge number (well over 10^30) of different strings that match your requirement and as you note you are just creating 2440 different hash codes. – juharr Jun 08 '15 at 13:42
  • What role does that number have in your hashtable, 2440? What do you use it for? Is it the key? – Lasse V. Karlsen Jun 08 '15 at 13:43
  • @Luaan yeah i agree that you can use it as a hashfunction, but it would not be a very efficient one. you would have more collisions – yahya el fakir Jun 08 '15 at 13:43
  • My own package at https://www.nuget.org/packages/SpookilySharp does a good job for speed, especially on 64 bit, if I may say so. It's also great for avoiding collisions (I can claim some credit for the speed, but Bob Jenkins deserves most of it, and all the credit for the collision avoidance). – Jon Hanna Jun 08 '15 at 15:35

1 Answers1

0

Adding up "ASCII values" is a good start for an educational hash function. Your concern around the maximum has value of 2440 is valid. A good and simple way to get more diverse numbers is this:

int hash = 1234;
foreach (var c in chars) {
 hash += c;
 hash *= 37;
}

The multiplication propagates bits to the left and makes all bits interact with many others.

You can incrementally improve this function.

usr
  • 168,620
  • 35
  • 240
  • 369