1

I have a huge set of long integer identifiers that need to be distributed into (n) buckets as uniformly as possible. The long integer identifiers might have pockets of missing identifiers. With that being the criteria, is there a difference between Using the long integer as is and doing a modulo (n) [long integer] or is it better to have a hashCode generated for the string version of long integer (to improve the distribution) and then do a modulo (n) [hash_code of string(long integer)]? Is the additional string conversion necessary to get the uniform spread via hash code?

Since I got feedback that my question does not have enough background information. I am adding some more information.

The identifiers are basically auto-incrementing numeric row identifiers that are autogenerated in a database representing an item id. The reason for pockets of missing identifiers is because of deletes.

The identifiers themselves are long integers. The identifiers (items) themselves are in the order of (10s-100)+ million in some cases and in the order of thousands in some cases.

Only in the case where the identifiers are in the order of millions do I want to really spread them out into buckets (identifier count >> bucket count) for storage in a no-SQL system(partitions).

I was wondering if because of the fact that items get deleted, should I be resorting to (Long).toString().hashCode() to get the uniform spread instead of using the long numeric directly. I had a feeling that doing a toString.hashCode is not going to fetch me much, and I also did not like the fact that java hashCode does not guarantee same value across java revisions (though for String their hashCode implementation seems to be documented and stable for the past releases across years )

Swami PR
  • 793
  • 1
  • 8
  • 15
  • You didn't specify any constraint on buckets. For what it's worth, as the problem is presented, you can just drop the first total/n identifiers in first bucket regardless of their value, and move on to the next bucket. You'll get a perfect (albeit probably useless) distribution. – spectras Oct 10 '17 at 16:06
  • I just came across this https://google.github.io/guava/releases/19.0/api/docs/com/google/common/hash/Hashing.html#consistentHash(long, int). Would this help my use case better than the modulo approach I was attempting earlier? – Swami PR Oct 11 '17 at 15:12
  • It's hard to know without specifics of your dataset. Probably the easiest it to try on a sample. Start with the modulo approach, if it's not balanced try the fastest hash, then walk down from fastest to slowest and pick the first one that yields acceptable results. – spectras Oct 12 '17 at 08:31
  • Like I mentioned in my question. I do not have a handle on how the production DB column ids are distributed actually. I can say this, however. They are generated sequentially (sequential long ids). So they start from about 1000 going up to around 240000000. There are going to be pockets of missing sequences owing to items being deleted (but not very frequently). So at the most, I can say that the ids are sequential for the large parts with holes. The dataset is not sparse for sure. So I could assume that the dataset is largely sequential for most parts. – Swami PR Oct 13 '17 at 14:35
  • Thanks for all the insights @spectras – Swami PR Oct 13 '17 at 14:40

2 Answers2

1

There's no need to involve String.

new Integer(i).hashCode()

... gives you a hash - designed for the very purpose of evenly distributing into buckets.

new Integer(i).hashCode() % n

... will give you a number in the range you want.


However Integer.hashCode() is just:

 return value;

So new Integer(i).hashCode() % n is equivalent to i % n.

slim
  • 40,215
  • 13
  • 94
  • 127
  • That's pretty logical actually, this `hashCode` implementation assumes input values are already evenly distributed over all values that Integer can represent, and used as such. This will fail horribly if that assumption doesn't hold. For exemple: imagine my input values are in range [0, 99] and I have 40 buckets. You'll get on average 50% more items in buckets 0-19 than in buckets 20-39. – spectras Oct 10 '17 at 16:27
  • The lesson here is that Integer's default hashCode is really bad, not that a hash code is unnecessary. – Matt Timmermans Oct 10 '17 at 16:56
  • 1
    No, Integer's hashCode() method is not "really bad". Its purpose is not to produce an evenly distributed set, but to try and produce a unique hashCode for each different value, as far as possible. And Integer.hashCode() achieves that task superbly! For evenly distributed, have a look at what the HashMap implementation does to the hashCode in order to turn it into an evenly distributed value. – DodgyCodeException Oct 10 '17 at 17:21
1

Your question as is cannot be answered. @slim's try is the best you will get, because crucial information is missing in your question.

  • To distribute a set of items, you have to know something about their initial distribution.

If they are uniformly distributed and the number of buckets is significantly higher than the range of the inputs, then slim's answer is the way to go. If either of those conditions doesn't hold, it won't work.

  • If the range of inputs is not significantly higher than the number of buckets, you need to make sure the range of inputs is an exact multiple of the number of buckets, otherwise the last buckets won't get as many items. For instance, with range [0-999] and 400 buckets, first 200 buckets get items [0-199], [400-599] and [800-999] while the other 200 buckets get iems [200-399] and [600-799].

    That is, half of your buckets end up with 50% more items than the other half.

  • If they are not uniformly distributed, as modulo operator doesn't change the distribution except by wrapping it, the output distribution is not uniform either.

    This is when you need a hash function.

    But to build a hash function, you must know how to characterize the input distribution. The point of the hash function being to break the recurring, predictable aspects of your input.

To be fair, there are some hash functions that work fairly well on most datasets, for instance Knuth's multiplicative method (assuming not too large inputs). You might, say, compute

hash(input) = input * 2654435761 % 2^32

It is good at breaking clusters of values. However, it fails at divisibility. That is, if most of your inputs are divisible by 2, the outputs will be too. [credit to this answer]

I found this gist has an interesting compilation of diverse hashing functions and their characteristics, you might pick one that best matches the characteristics of your dataset.

spectras
  • 13,105
  • 2
  • 31
  • 53