0

Let A be a sequence [1, N) of integers, for example, for N=5: 0,1,2,3,4. Then lets randomly shuffle it into S (for N=5, it could be: 4,1,3,2,0). Is there any smart way to compress S (randomized array that contains each integer from [0, N) exactly once)?

The best thing I could come up with doesn't sounds like an optimal solution: maintain a list of "unused" integers (initialized with [0, N)) and output each sequence item as an index from that list using minimum amount of bits (since unused will become smaller, number of bits required to output each index will decrease as well). Something like this:

unused = [0, N)
for x in S:
   k = index of x in unused
   nbits = log2(len(unused))
   output k as nbits integer
   remove x from unused
wonder.mice
  • 7,227
  • 3
  • 36
  • 39
  • 3
    Maybe http://stackoverflow.com/questions/1506078/fast-permutation-number-permutation-mapping-algorithms helps. – ymonad Oct 01 '15 at 02:40
  • @ymonad Thanks, that was helpful. I especially liked swap trick from Antoine Comeau that allows to do it in O(n). Will not be efficient for larger N though because of math that requires more that 64bits. However, I later realized that I could use an arithmetic codec with model that shrinks (once symbol was used, it's probability goes to 0) for that to get similar results. – wonder.mice Oct 01 '15 at 17:59

1 Answers1

2

In your example, there are 5! = 120 possible ways to shuffle the sequence. The best you can do, assuming that all shuffles are equally likely, is to code each shuffle as an integer in 0..119.

You can see the answer to the question linked in the comment by @ymonad for ways to do that.

Let's take a larger example to see how well you can do. Let's say we have the integers 0..255 randomly shuffled. The most straightforward way to store that, and the form you are likely to need it in to be able to use it, is a sequence of 256 bytes. The number of bits required to represent an integer in 0..256!-1 is 1684 bits, or 210.5 bytes. So the very best you can do is knock off about 18% of the bits from the straightforward representation. Longer shuffles give you less compression. In the limit, the compression ratio you get is about 1 - 1 / ln(n), where n is the number of unique items shuffled.

So your mileage will be limited. You might consider not bothering to try to compress it.

Community
  • 1
  • 1
Mark Adler
  • 101,978
  • 13
  • 118
  • 158
  • Yeah, I already realized that too, but thanks for confirming, I was a bit in doubt that my calculations are correct. This [plot](http://www.wolframalpha.com/input/?i=plot+2*n+%2F+%28log%282%2C+n%21%29+%2F+8%29%2C+n+from+16+to+128*128%2C+y+from+1+to+3) shows compression ratio for sequence of `short`s (2 bytes) for N from 16 to 16384. So ~1.5 is the best I can expect on average. Definitely not worth the effort. – wonder.mice Oct 01 '15 at 17:52