3

I have two unsorted arrays of 32-bit unsigned integers, size N1 and N2, respectively. Each array may contain duplicates. I would like to map each value (2^32 possible keys) to a spot in a byte-array of size (N1 + N2) to record frequencies of each key. Duplicate key values should map to the same position in this array. Additionally, the frequency of each integer won't go above 100 (which is why I chose a byte-array to record each key's frequency to save space); if the max possible frequency were to go above this, I would simply change the byte-array to an array of shorts or something.

In the end, I need an array of size N1 + N2 -- not necessarily all entries will be used, as duplicates may have been encountered -- with frequencies of each unique key value. Worst case scenario, only one byte entry will be used (e.g. all values in both arrays are the same) leaving ((N1 + N2) - 1) entries unused. Best case scenario, all byte-entries are used.

From what I understand, I need to find a minimally perfect hashing function to map a known number of unknown keys (N1 + N2; all ranging from 0 - 2^32) to a known number of spots (N1 + N2). I was able to find a few other posts, but both answers basically said use gperf:

Is it possible to make a minimal perfect hash function in this situation?

Minimal perfect hash function

The second one (Minimal perfect hash function) is exactly what I'm attempting to do.

Rather than expecting source code from an answer (I'm using C by the way), I'd much prefer an explanation of how to go about creating a minimally perfect hashing function for N-number of any possible positive integers to N buckets. I could easily do this with a 4 GB array of direct mappings for every possible integer with lots of unused space, but I'd rather try to reduce this massive inefficiency of space. I'm also hoping to not use any external libraries, mostly for educational purposes to learn more about hashing, itself.

Community
  • 1
  • 1
datboi
  • 437
  • 4
  • 10
  • 1
    Why must it be _minimal_ ? Why must it be _perfect_ ? Given the requirement: tablesize = (n1+n2), an ordinary hash function would need only avg ~(O= 1.5) * (n1+n2) probes. – wildplasser Nov 07 '13 at 00:27
  • I never said it needed to be perfect; my goal's a minimal-perfect hash table. And I'm honestly just looking for advice on how to design a hash function that can generate unique indicies for any positive integer with only using a maximum of (N1 + N2) entries with no collisions. Not sure if this is right, but a worst-case scenario with collisions allowed results in O(N) operation, would it not? – datboi Nov 07 '13 at 00:42
  • A perfect hash function would have exactly 2^32 entries (in this case) for N-number of input values so no collisions would ever happen. A minimally-perfect hash function would only require N-number of unique entries for N-number of unique input values. This is a good explanation of the difference between the two: http://cmph.sourceforge.net/concepts.html – datboi Nov 07 '13 at 01:07

2 Answers2

1

This is clearly impossible. If you have N numbers, there's no way to come up with a function which will hash them all to distinct values in the range [0, N) unless you know what those numbers are going to be beforehand. Otherwise, given any such function (with N < 2^32, of course), there will be at least one pair of integers such that both of those integers hash to the same value, so that function won't be perfect if those integers both show up in the input.

If you relax the conditions to allow the function to be created on the fly, this becomes possible, but only in a really trivial and useless way. Namely, a hash function could build itself up as it goes by recording each number that's fed into it and generating a new unique output for each one (say, counting up from 0). But such a function would need a hash table (or something equivalent) as part of its implementation, so it'd certainly be no use in implementing a hash table!

  • They shouldn't ALL hash to unique values -- they can, though. If all N1 + N2 numbers are the same actual number, they would all hash to the same spot in the N1 + N2 array. Not sure if this changes whether or not it's possible, but it sounded like you might have misunderstood the question. – datboi Nov 07 '13 at 00:49
  • @user2962642 Math is not fueled by hopes; it thrives on fear. The possibility of helpful duplicates cannot provide for any solution that also works without them. You only need to consider the worst case. – Potatoswatter Nov 07 '13 at 01:04
  • @user2962642: Your comment says that entries do not need to hash to distinct spots because some entries may be equal. That is different from what the problem this answer addresses, which is that no possible function can guarantee that it will hash different entries to different spots, because the number of entries it can be given (over multiple possible calls) is greater than the number of spots. – Eric Postpischil Nov 07 '13 at 01:08
  • E.g., consider a hash function f(x) that we claim maps any list of N1+N2 entries to N1+N2 spots, with no duplicates unless the entries are duplicated. Consider the values f(0), f(1), f(2),… f(N1+N2). There are N1+N2+1 of these values. Therefore, at least two different values among them must map to the same spot in the N1+N2 spots. All I have to do is find those two entries, A and B, and then present a list containing both A and B. Then the function f fails to hash differently-valued entries in this list to different spots. – Eric Postpischil Nov 07 '13 at 01:10
  • @Eric: Going to the pigeonhole principle in the other answer posted here, would the number of "pigeons" be N1 + N2 since we know this is the maximum number of unique keys we can possibly have or would it be 2^32? – datboi Nov 07 '13 at 01:23
0

According to the Pigeonhole Principle, you will have "hash slots" occupied by more than one number. In other words: different numbers will "hash" to the same value.

Now, I wonder if you could benefit from a Bloom Filter. From Wikipedia:

False positive matches are possible, but false negatives are not; i.e. a query returns either "possibly in set" or "definitely not in set".

If something is "definitely" not in the set of keys, you can move on (its frequency is one), and if it possibly is in the set, then you process it further to accumulate its actual statistic.

Escualo
  • 40,844
  • 23
  • 87
  • 135
  • I'm somewhat unclear about this, but in the scenario I'm proposing, would the number of "pigeons" be N1 + N2 or 2^32? I understand that the number of "pigeonholes" would be N1 + N2, since that's what I'm trying to map these values to. – datboi Nov 07 '13 at 01:18
  • From how I understood the question (and I may have misunderstood it) there are 2^32 pigeons because those are your possible data members (otherwise the hash would not be perfect). If you were to create a hash function on a *per problem* instance, you could relax this assumption, but you would loose all generality (i.e., you would have to create a new hash function for every pair of data containers. – Escualo Nov 07 '13 at 02:52
  • Yes I think you're right. I was confused and thought that _domain_ of the input was 2^32, not the number of "pigeons". I am pretty sure they are one in the same, though. – datboi Nov 07 '13 at 23:56