0

I am trying to distribute a set of items across number of buckets. I am looking for following properties:

  1. Bucket assignment needs to be deterministic. In different runs same input should end up in the same bucket.

  2. Distribution of data between buckets should be uniform.

  3. This should work for fairly small number of inputs (e.g. if I want to distribute 50 inputs across 25 buckets ideally each bucket will have 2 items).

First try was to generate md5 from input data and form bucket from first bytes of md5. I am not too satisfied with uniformity. It works well when input is large but not so well for small input. E.g. distributing 100 items across 64 buckets:

List<string> l = new List<string>();

for (int i = 0; i < 100; i++)
{
    l.Add(string.Format("data{0}.txt", i));
}

int[] buckets = new int[64];

var md5 = MD5.Create();
foreach (string str in l)
{
    {
        byte[] hash = md5.ComputeHash(Encoding.Default.GetBytes(str));
        uint bucket = BitConverter.ToUInt32(hash, 0) % 64;
        buckets[bucket % 64]++;
    }
}

Histogram

Any suggestions what could I do to achieve higher uniformity? Thanks.

Klark
  • 8,162
  • 3
  • 37
  • 61
  • 4
    That sure looks like a pretty uniform distribution, given how small the sample size is. Anything that would do better for that one set of data is likely to perform similar to this (if not worse) for similar but slightly different data. It's not like you have any particularly large spikes. Your max value is still quite close to the average. – Servy Feb 14 '19 at 16:54
  • Well, hash is a function of data so its distribution cannot be independent of the data distribution. – Dmytro Mukalov Feb 14 '19 at 17:11
  • Have you compared this to the basic `String.GetHashCode()` ? MD5 was never meant for this, and I think it will be kind of slow. – H H Feb 14 '19 at 17:59
  • You can searching for packing algorithms. Romans tried to solve problem 2000 years ago. The looks for solution to load chariots going to war evenly. Put too much in one chariot is went to slow and easily tipped over. Put too little in chariot and you needed more chariots. – jdweng Feb 14 '19 at 18:03
  • @Henk: String.GetHashCode() is not guaranteed to return the same value for the same string in different runs of the program. The docs warn you about that. Also see https://andrewlock.net/why-is-string-gethashcode-different-each-time-i-run-my-program-in-net-core/ – rici Feb 14 '19 at 18:08
  • I don't think you can do better than that. Maybe it would suffice for each item to be in one of two deterministic places. Then you could use the power of two choices algorithms. – David Eisenstat Feb 14 '19 at 18:23
  • Have you considered first compressing the data using some compression algorithm (not necessarily lossless), interpreting the result as a binary number and using that modulo the number of buckets? Intuitively, encoded representations of your objects should be distributed fairly uniformly over the set of all available strings (to the degree this is not the case, the compression is less than optimal). – Patrick87 Feb 14 '19 at 18:42
  • Your requirements indicate that you're trying to do something that isn't really a hashing problem. What is the larger problem you're trying to solve? – Jim Mischel Feb 14 '19 at 18:43
  • @rici - I know, there is nothing in the question about persistance. If it has to be carried over to another process, you could roll your own. – H H Feb 14 '19 at 19:14
  • 1
    @henk: how do you interpret "In different runs same input should end up in the same bucket."? (Requirement 1 in the OP.) – rici Feb 14 '19 at 19:56
  • OK, I missed that. But still, MD5 is generating 128 bits of which only 6 are used. Crypto safety is also not called for. – H H Feb 14 '19 at 20:00
  • @Henk, yes I agree that MD5 is overkill. But good persistent fingerprint functions are not always easy to find, so I can see the temptation to use it. Here's a relevant previous question and answer: https://stackoverflow.com/questions/36845430/persistent-hashcode-for-strings – rici Feb 14 '19 at 20:27
  • Give us the *problem* that you think means you need to divide things into "buckets", instead, please. – Damien_The_Unbeliever Feb 14 '19 at 21:00

1 Answers1

1

Leaving aside the efficiency of using MD5 for this purpose (see the discussion here and in the marked duplicate of that question), basically the answer is that what you have is what a uniform distribution really looks like.

That might seem counter-intuitive, but it's easily demonstrable either mathematically or by experiment.

As a kind of motivating example, consider the task of choosing exactly 64 numbers in the range 0-63. The odds that you will get one per bucket are very close to 0. There are 6464 possible sequences, of which 64! contain all 64 numbers. The odds of getting one of these sequence is about one in 3.1×1026. In fact, the odds of getting a sequence in which no element appears three times is less than one in a thousand (it's about .000658). So it's almost certain that a random uniform sample of 64 numbers in the range 0-63 will have some triplets, and it's pretty likely that there will be some quadruplet. If the sample is 100 numbers, those probabilities just get even bigger.

But the maths are not so easy to compute in general, so here I chose to illustrate by experiment :-), using random.org, which is a pretty reliable source of random numbers. I asked it for 100 numbers in the range 0-63, and counted them (using bash, so my "graph" is not as pretty as yours). Here are two runs:

First run:

Random numbers: 
44  17  50  11  16   4  24  29  12  36
27  32  12  63   4  30  19  60  28  39
22  40  19  16  23   2  46  31  52  41
13   2  42  17  29  39  43   9  20  50
45  40  38  33  17  45  28   6  48  12
56  26  34  33  35  40  28  44  22  10
50  55  49  43  63  62  22  50  15  52
48  54  53  26   4  53  13  56  42  60
49  30  14  55  29  62  15  13  35  40
22  38  37  36  10  36   5  41  43  53

Counts:                                                                
                      X                 X         X             
    X       XX   X    X     XX      X   X  X      X  X          
  X X     X XX XXX X  X   X XXX  X XX XXXXXXXX  XXX XX XX   X XX
  X XXX  XXXXXXXXX XX XXX XXXXXXXXXXXXXXXXXXXXX XXX XXXXX   X XX
----------------------------------------------------------------
          1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5 6 6
0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2

Second run:

Random numbers:
41  31  16  40   1  51  17  41  27  46
24  14  21  33  25  43   4  36   1  14
40  22  11  22  30  19  23  63  39  61
 8  55  40   6  21  13  55  13   3  52
17  52  53  53   7  21  47  13  45  57
25  27  30  48  38  55  55  22  61  11
11  28  45  63  43   0  41  51  15   2
33   2  46  14  35  41   5   2  11  37
28  56  15   7  18  12  57  36  59  51
42   5  46  32  10   8   0  46  12   9

Counts:
           X                             X    X        X        
  X        X XX      XX                 XX    X    X   X        
XXX  X XX  XXXXX X   XX  X XX X  X  X   XX X XX    XXX X X   X X
XXXXXXXXXXXXXXXXXXXX XXXXX XX XXXX XXXXXXXXX XXXX  XXX XXX X X X
----------------------------------------------------------------
          1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5 6 6
0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2

You could try this with your favourite random number generator, playing around with the size of the distribution. You'll get the same sort of shape.

Community
  • 1
  • 1
rici
  • 234,347
  • 28
  • 237
  • 341