6

I'm trying to come up with an algorithm to hash a string into a specific number of buckets but haven't had any luck coming up with ideas on how to do this?

I have a list of strings like this:

a.jpg
b.htm
c.gif
d.jpg
e.swf

and i would like to run a function to get a number between 1 and 4 based on the string.

e.g. a.jpg would be 3
b.htm would be 2
c.gif would be 1
etc

it needs to be consistent so if i run the function on a.jpg it always returns 3.

this algorithm would be for splitting resources between servers...

e.g. a.jpg would be accessed from server3.mydomain.com
b.htm would be accessed from server2.mydomain.com
etc

Does anyone know how I would go about doing this?

Any advice would be much appreciated!

Cheers

Tim

  • 2
    All of the solutions below that suggest using `GetHashCode` are fundamentally broken. See http://stackoverflow.com/questions/5154970/how-do-i-create-a-hashcode-in-net-c-for-a-string-that-is-safe-to-store-in-a-d. If your servers are running different versions of .NET, it's likely that they will compute different hash values. If you want a guarantee, implement your own hashing function. – Jim Mischel Mar 23 '11 at 23:47
  • thanks, you're right that is a problem as i need the hashes to be consistent. the hash function from jon skeet's answer in the link you posted looks like it will work just fine. –  Mar 24 '11 at 01:08

4 Answers4

4

You may find the following blog post useful. The proposed algorithm is:

int bucketIndex = (int)((uint)"d.jpg".GetHashCode() % (uint)buckets.Length);
Darin Dimitrov
  • 1,023,142
  • 271
  • 3,287
  • 2,928
  • You need to add 1 to that (since the OP wants a value between 1 and buckets) – Thomas Levesque Mar 23 '11 at 23:11
  • @Thomas Levesque, the proposed algorithm is from the blog post I've linked to in my answer. In it the author uses this as index, so it is 0 based. I didn't want to modify it in order to preserve the original idea. The OP could adapt based on his needs. – Darin Dimitrov Mar 23 '11 at 23:13
  • Is GetHashCode() a consistent implementation across all .NET framework versions past, present and future? Depending on who is participating in this process, you might want to use an explicit known hash algorithm. – Eric Rini Nov 07 '17 at 15:45
1

Standard GetHashCode and % will work: Math.Abs("aaaa".GetHashCode()) % numberOfBuckets.

EDIT thanks Thomas Levesque for reminding of GetHashCode() returning < 0. Added Math.Abs to have correct code, but versions in other answers are likely work better.

Alexei Levenkov
  • 98,904
  • 14
  • 127
  • 179
  • This is exactly what I'm after but I've just run the code on 1000 strings and noticed it does return some negative values. Is there a quick fix for this? –  Mar 23 '11 at 23:14
  • 2
    This is going to fail if `GetHashCode` returns `int.MinValue`, because `Math.Abs(int.MinValue)` will throw `OverflowException`. – Jim Mischel Mar 23 '11 at 23:55
1
int bucket = (int)(unchecked(((uint)s.GetHashCode())) % 4 + 1)

(where s is the string)

Thomas Levesque
  • 286,951
  • 70
  • 623
  • 758
-1

Use a hash algorithm based on a shared machine key. This will create a unique identifier per string. If you require integers then use a dictionary object to map strings to ints. Every time you add a new string set its key to the current dictionary length. Finally store the dictionary in a farm based state object such as a shared session so that each site instance can reference it.

Brian Scott
  • 9,221
  • 6
  • 47
  • 68