Hashing the arrangement of an array

Question

I am trying to write a function which takes an array of integers, and outputs a hash value to signify their arrangement order. The hash key should be as small as possible (we're dealing with embedded and space/execution optimization is critical)

template<size_t N>
int CalculateHashOfSortOrder(int* itemsBegin)
{
    // return ???
}

In other words, an array [ 4, 2, 1, 3 ] should produce an ID which reflects their order and would match any of the following arrays:

[ 4, 2, 1, 3 ]
[ 4, 2, 0, 3 ]
[ 400, 200, 100, 300]
et al

I am at a loss for how to do this in the most efficient way possible. What is the algorithm to approach this? The methods I come up with seem extremely slow.

For example, given the above example, there are 24 possibly arrangements (4 * 3 * 2 * 1), which would fit into 5 bits. Those bits can be calculated like this:

1st value (1). There are 4 positions possible, so 2 bits are needed to describe its position. This value is at position 2 (0-based). So push binary 10.
2nd value (2). There are 3 positions possible, so 2 bits are needed. It's at position 1, so push binary 01. Hash key is now 1001b.
3rd value (3). There are 2 positions possible, so 1 bit is needed. It's at position 1 of the remaining values, so push 1.

The resulting key is 10011b, but I don't see a clear way to make that anything but obnoxiously inefficient. I think there's another way to approach this problem I just haven't thought of.

edit: A couple ideas come to mind:

Calculate the rank of each item in the array, and hash the rank array. Then by what method can you pack that rank array into a theoretically-smallest ID? This appears to be the most vexing element of my question.
Find a way to save item rank as they're inserted, optimizing #1
For sufficiently small # of items (<10), it may be possible to just generate a huge tree of if() statements via metaprogramming. Need to see if exe size would be a problem.

"obnoxiously inefficient" -- In space? Or clumsiness of calculation? 5 bits is _nearly_ the theoretical minimum. For 4 items, it is the min (5 bits); for 5 items, you would get 8 bits, but the theoretical min is 7. — Rick James, Oct 11 '20 at 19:47
@RickJames mostly I can't think of a straight-forward implementation. My human-style description doesn't seem to translate to code very time-efficiently. — tenfour, Oct 11 '20 at 21:24
Maybe "hash" is the wrong term; @Dave I can't tolerate any collisions. Each arrangement should produce a unique space-optimized ID. — tenfour, Oct 12 '20 at 07:15
The term you are looking for in that case is probably canonical form. https://en.wikipedia.org/wiki/Canonical_form — Nuclearman, Oct 12 '20 at 07:21
I also found a related question where these are reffered to as a [Lehmer Code](https://stackoverflow.com/questions/1506078/fast-permutation-number-permutation-mapping-algorithms#comment19487170_1506337) — tenfour, Oct 12 '20 at 07:23

score 4 · Accepted Answer · answered Oct 11 '20 at 23:19

4

An equivalent hash goes something like this for [ 40, 20, 10, 30 ]

40 is greater than 3 of the subsequent values
20 is greater than 1 of the subsequent values
10 is greater than 0 of the subsequent values
30 has nothing after it, so ignore it.

That is a pair nested loop of time Order(N^2). (Actually about 4*4/2, where there are 4 items.) Just a few lines of code.

Pack 3,1,0, either the way you did it or with anatolyg's slightly tighter:

3 * 3! +
1 * 2! +
0 * 1!

which equal 20. It needs to be stored in the number of bits needed for 4!, namely 5 bits.

I'm pretty sure this is optimal for space. I have not thought of a faster way to compute it other than O(N^2).

How big is your N? For N=100, you need about 520 bits and 5K operations. 5K operations might take several microseconds (for C++), and probably less than a millisecond (even for an interpreted language).

answered Oct 11 '20 at 23:19

Rick James

135,179
13
127
222

Since the OP is considering a hard-coded solution in #3, one possibility to perhaps speed this up to to create a hard coded 2D lookup table (seems much better than a bunch of if-else logic), for `a * b!` where `0 <= a,b <= X` for some `X`. This allows you to just do something like `table[3][3] + table[1][2] + table[0,1]`, saving the factorial/multiplications. – Nuclearman Oct 12 '20 at 06:47
I think this is the right idea. My `N` is small: < 10. So a LUT as @Nuclearman suggests, should make this the best solution. – tenfour Oct 12 '20 at 07:18
Not sure it's faster than just the `O(N^2)` solution (that lookup table may not be faster either if `N` is only 10), but you can also create a lookup table for determining the value of `b` in my comment above. 1) Note which number is last in the list. 2) Sort the numbers (in-place if possible). 3) Loop over the table with `a` being the number and `b` being it's index in the sorted array calculating sum of `table[a][b]`. Skip the value of `a` that was last in the unsorted order. – Nuclearman Oct 12 '20 at 07:48

score 2 · Answer 2 · answered Oct 11 '20 at 18:58

2

Use something like factorial number system. If you concatenate bits, it's as if you multiply the first number by some power of 2, the second number by another power of 2, etc. In factorial number system, you multiply by a more "fair" factor, which is not a power of 2.

To make the result small, calculate it modulo some small prime number, e.g. 127. Or a moderately big prime number. The "mod" actually should not necessarily be a prime number, it just shouldn't have any prime factors smaller than N (the length of your list).

answered Oct 11 '20 at 18:58

anatolyg

26,506
9
60
134

Re modulo: I don't see how this would do anything other than destroy info and fold different orderings together. Can you show how it would apply to the OP's example? – Rick James Oct 11 '20 at 19:51
Usually, when you do hashing, you want to destroy info, so that the output has fixed size, even though input size may be unlimited. I see now that OP wants to prevent collisions, so the "mod" part is unnecessary. – anatolyg Oct 12 '20 at 16:43

score 1 · Answer 3 · edited Oct 12 '20 at 16:34

To calculate the rank of each item in the array

sort two arrays in parallel (or an array of 2-tuples, if your language makes that easy). The first element in the array is the value, the second is the initial index. Keep in mind that you can use radix sort, if these are really integers.
At the end, drop the sorted values, and only look at the indices. This is the array of ranks (well actually, it's the inverse permutation of the array of ranks, but since you're hashing I assume you don't care).

Then, convert the array of ranks into bits. You can look at how to store a "permutation" or use factorials, if you really want to pack it in. Or, you can just use (log n) bits per number, and pack them all together--as long as you don't need to reverse this whole process, that's what I'd recommend. You need about N log(N) bits regardless of how you do this, but you can get that multiplier down to the minimum in practice if you need.

The metaprogramming technique you describe is called a "sorting network", and is used to make sorts much faster when you get down to the smallest groups at the end of recursive sorts.

Can you say more about your actual problem? This sounds likely to be a XY problem.

I know it sounds like an XY problem (though, doesn't almost everything?). It's just hard to explain why it's needed. It's part of a bigger algorithm that uses a huge LUT based on the "importance" (rank) of items. The function I'm asking about is to generate LUT keys. — tenfour, Oct 12 '20 at 07:21
Yeah, these often come up on stackoverflow out of an effort to make the problem simple enough to understand (to yourself or others). It's not a bad thing, but I thought it was worth flagging here. For example, do you need the importance of all items? Maybe you can throw out 90% of the keys. I would suggest elaborating. — Zachary Vance, Oct 12 '20 at 20:38

score 0 · Answer 4 · answered Oct 12 '20 at 01:59

0

Finding n-th permutation without computing others

This is a solution that lets you get the k'th permutation of n elements in a consistent but non-lexicographic ordering in O(n).

answered Oct 12 '20 at 01:59

Dave

7,460
3
26
39

Hashing the arrangement of an array

4 Answers4