Perfect minimal hash for mathematical combinations

Question

First, define two integers N and K, where N >= K, both known at compile time. For example: N = 8 and K = 3.

Next, define a set of integers [0, N) (or [1, N] if that makes the answer simpler) and call it S. For example: {0, 1, 2, 3, 4, 5, 6, 7}

The number of subsets of S with K elements is given by the formula C(N, K). Example

My problem is this: Create a perfect minimal hash for those subsets. The size of the example hash table will be C(8, 3) or 56.

I don't care about ordering, only that there be 56 entries in the hash table, and that I can determine the hash quickly from a set of K integers. I also don't care about reversibility.

Example hash: hash({5, 2, 3}) = 42. (The number 42 isn't important, at least not here)

Is there a generic algorithm for this that will work with any values of N and K? I wasn't able to find one by searching Google, or my own naive efforts.

Just to be clear; what are the input and output of your hash function? — Oliver Charlesworth, Jan 04 '13 at 16:30
@OliCharlesworth Edited. Input is one of the possible combinations, output is an integer in [0, C(N, K)). — Kendall Frey, Jan 04 '13 at 16:32

Ivaylo Strandjev · Answer 1 · 2013-01-04T18:41:31.997

3

There is an algorithm to code and decode a combination into its number in the lexicographical order of all combinations with a given fixed K. The algorithm is linear to N for both code and decode of the combination. What language are you interested in?

EDIT: here is example code in c++(it founds the lexicographical number of a combination in the sequence of all combinations of n elements as opposed to the ones with k elements but is really good starting point):

typedef long long ll;

// Returns the number in the lexicographical order of all combinations of n numbers
// of the provided combination. 
ll code(vector<int> a,int n)
{
    sort(a.begin(),a.end());
    int cur = 0;
    int m = a.size();

    ll res =0;
    for(int i=0;i<a.size();i++)
    {
        if(a[i] == cur+1)
        {
            res++;
            cur = a[i];
            continue;
        }
        else
        {
            res++;
            int number_of_greater_nums = n - a[i];
            for(int j = a[i]-1,increment=1;j>cur;j--,increment++)
                res += 1LL << (number_of_greater_nums+increment);
            cur = a[i];
        }
    }
    return res;
}
// Takes the lexicographical code of a combination of n numbers and returns the 
// combination
vector<int> decode(ll kod, int n)
{
    vector<int> res;
    int cur = 0;

    int left = n; // Out of how many numbers are we left to choose.
    while(kod)
    {
        ll all = 1LL << left;// how many are the total combinations
        for(int i=n;i>=0;i--)
        {
            if(all - (1LL << (n-i+1)) +1 <= kod)
            {
                res.push_back(i);
                left = n-i;
                kod -= all - (1LL << (n-i+1)) +1;
                break;
            }
        }
    }
    return res;
}

I am sorry I have an algorithm for the problem you are asking for right now, but I believe it will be a good exercise to try to understand what I do above. Truth is this is one of the algorithms I teach in the course "Design and analysis of algorithms" and that is why I had it pre-written.

edited Jan 04 '13 at 18:41

answered Jan 04 '13 at 16:30

Ivaylo Strandjev

69,226
18
123
176

Is there no constant time algorithm? I'd really prefer that. – Kendall Frey Jan 04 '13 at 16:42
@KendallFrey You can not have a constant time algorithm - your input is of size `N` (for the problem I solve above)! You can not perform better than the algorithm above. I guess you can write a O(K) solution for your problem by slightly modifying the solution above and again you **can't perform better** as this is the input size. – Ivaylo Strandjev Jan 04 '13 at 16:42
Sorry, by constant time I meant O(K). K will be a compile-time constant. – Kendall Frey Jan 04 '13 at 16:48
@KendallFrey I believe you can get as low as `O(K)` assuming you have already calculated the powers of two up to `N`. Still I think that as K will probably be in the order of N there will not be a huge difference(Number of combinations grows really fast so both N and K will be quite small). – Ivaylo Strandjev Jan 04 '13 at 16:57
@KendallFrey is it possible that you got the complexity for my solution wrong? My solution is linear to the number of numbers in S namely `N`, not to the number of possible combinations. – Ivaylo Strandjev Jan 04 '13 at 17:00
I understand the `O(N)` complexity, but it seems excessive to me because my specific implementation will have `N` quite large (< 100 though). – Kendall Frey Jan 04 '13 at 17:05
Why would you cache powers of two though? A bitshift is one of the fastest things you can do – harold Jan 04 '13 at 17:25
@harold true. I can't seem to remember why I did it and it makes no sense to me now :). I will fix that – Ivaylo Strandjev Jan 04 '13 at 18:39
@izomorphius sort(a.begin(),a.end()) is not linear – Dejan Jovanović Jan 04 '13 at 20:24
it is if your values are in small range. Seems there is optimization to use count sort in such cases. Verified on gcc 4.3 looong ago – Ivaylo Strandjev Jan 04 '13 at 21:33

Michael · Answer 2 · 2013-12-16T23:07:52.447

This is what you (and I) need:

hash() maps k-tuples from [1..n] onto the set 1..C(n,k)\subset N. The effort is k subtractions (and O(k) is a lower bound anyway, see Strandjev's remark above):

// bino[n][k] is (n "over" k) = C(n,k) = {n \choose k}
// these are assumed to be precomputed globals

int hash(V a,int n, int k) {// V is assumed to be ordered, a_k<...<a_1
  // hash(a_k,..,a_2,a_1) = (n k) - sum_(i=1)^k (n-a_i   i) 
  // ii is "inverse i", runs from left to right

  int res = bino[n][k];
  int i;

  for(unsigned int ii = 0; ii < a.size(); ++ii) {
    i = a.size() - ii;   
    res = res - bino[n-a[ii]][i];
  }
  return res;
}

Perfect minimal hash for mathematical combinations

2 Answers2

Linked