1

What's the value range of String.GetHashCode()?

For random strings with different length, are their hash code value range different?

e.g.

There're 2 groups of random strings. Group 1 strings are of length 5. Group 2 strings are of length 10. Do the 2 groups have the same hash code value range?

Update 1

My problem scenario is:

I have a method with input as some fixed length random GUID strings. I need to pick a fixed (but not predefined) set of them at a fixed percentage. I am considering to divide the string hash code value range into 10 segments, and pick the strings whose hash value falls into the first segment. Thus I got a fixed 10% of all the input strings.

Update 2

The input GUID strings are not given in a list. They are given one by one. And there can be duplicated ones. And I will never know how many they are. I just need to make sure the overall percentage. And if a string was picked before, they will always be picked.

Below is my experiment:

static void Main(string[] args)
{
    double min = int.MaxValue / 100.0 * 15.0;
    double max = int.MaxValue / 100.0 * 25.0;
    double total = 0;
    double picked = 0;
    Console.WriteLine("range ratio: {0:f4}%", (max - min) / int.MaxValue * 100);

    for (int i = 0; i < 500000; i++)
    {
        string mcid = Guid.NewGuid().ToString();
        int hash = mcid.GetHashCode();
        total++;
        if (hash >= min && hash <= max)
        {
            picked++;
        }
        Console.Write("\rPicked: {0:f4}, Total {1:f4}, Ratio: {2:f4}%", picked, total, picked / total * 100.0);
    }
}

I run the code several times, the output is a bit strange. The ratio of picked GUID is always half of the range ratio. If this is true. I guess I can just use a double-sized range.

for example:

range ratio: 10.0000%

Picked: 25028.0000, Total 500000.0000, Ratio: 5.0056%

Community
  • 1
  • 1
smwikipedia
  • 61,609
  • 92
  • 309
  • 482
  • 1
    You can inspect the source yourself: http://referencesource.microsoft.com/#mscorlib/system/string.cs – Blorgbeard Aug 27 '14 at 03:27
  • By the way, are you trying to solve a problem with this question? – Blorgbeard Aug 27 '14 at 03:28
  • 1
    So, you want to pick a random set of 10% from a list of strings? I wouldn't use hashing, just shuffle and take the top (list.Length / 10) - see http://stackoverflow.com/questions/48087/select-a-random-n-elements-from-listt-in-c-sharp – Blorgbeard Aug 27 '14 at 03:38
  • Are these v4 (random) GUIDs? You could simply use a portion of the random bits from the guid to bucket them instead of going through String.GetHashCode. – Blorgbeard Aug 27 '14 at 05:03
  • I cannot use the shuffle approach. Because the GUID strings is not given in a list. – smwikipedia Aug 27 '14 at 05:28

1 Answers1

2

This is a definite "XY Problem" style question.

If you want to choose 10% of the GUIDs you are given as they are given, why not generate a random number in the range [0,1) and if the number is less than 0.1 then select the GUID.

Keep the GUIDs you pick in a list, and if it's provided again and is in the list, then it gets picked again (if I understand the "if a string was picked before, they will always be picked" requirement).

Community
  • 1
  • 1
Michael Burr
  • 333,147
  • 50
  • 533
  • 760
  • My scenario is part of a web application solution. And the GUIDs are used to represent clients. My purpose is to route a ***fixed*** set of clients to a test environment (so it's not a traditional traffic manipulation). Due to the massive amount of clients, I cannot keep the GUIDs in a list. I need kind of an *in-place* solution. – smwikipedia Aug 27 '14 at 05:32
  • 1
    I see - I think if you use your own hash function that you can control the distribution of (and verify with testing) you could use that hash to deterministically select whether a GUID is on one set or another (essentially, that's what a hash is supposed to do in a hash table: more or less evenly distribute items into buckets). However, I think I'd use my own hash function rather than one in the framework to make sure the function used will have the characteristics I wanted. – Michael Burr Aug 27 '14 at 06:18
  • +1. The. Net hash function may change at any time. You shouldn't rely on its internals. – Blorgbeard Aug 27 '14 at 07:34