9

Im trying to compress a sequence of non-negative numbers, where:

  • The value range of each number is from 0 to 2^N-1
  • Each number appears once only (it means there is total 2^N numbers )

    Example for N = 4:

    [14, 1, 8, 2, 12, 6, 0, 10, 4, 13, 5, 7, 15, 9, 3, 11]

So normally each number will cost 4 bits and for 16 numbers we will have to use 16x4 = 64 bits to store them.

Currently I have just thought of compressing them as below:

  • For the first 8 numbers --> use 4 bits to store each of them.
  • For the next 4 numbers ---> only 3 bits/each
  • For the next 2 numbers ---> only 2 bits/each
  • For the next 1 numbers ---> only 1 bit for it.
  • For the last one, actually its not necesssary to be stored (Obviously we should know what the last number is if we know all other 15 numbers )

So the compressed data size will be:

Z = 8 * 4 + 4 * 3 + 2 * 2 + 1 * 1 + 1 * 0 = 49 bits 

The compression ratio is about 76%, which is pretty good (I think).

But for larger values of N, the ratio seems to be decreased (For N = 2048, the ratio is only 91% )

So I would like to hear your suggestions for better compressing.

Thank you.

UenX
  • 162
  • 10
  • As you've tagged your question [tag:c++], would you mind showing a small sample of your implementation approach? It might be relevant for the worse compression ratios. – πάντα ῥεῖ May 08 '15 at 17:45
  • Does preserving original order matter? – jdphenix May 08 '15 at 17:46
  • @jdphenix: yes, it is very important. – UenX May 08 '15 at 17:47
  • 1
    @jdphenix the original order seems to be the only information being compressed here. – Drew Dormann May 08 '15 at 17:47
  • @jdphenix: I think that's exactly what matters. Each number only appears exactly once. – Eric J. May 08 '15 at 17:47
  • 1
    Since you use every possible number once, I think the most entropy would be in something **similar to** storing a single number that describes the number of calls to [`std::next_permutation`](http://en.cppreference.com/w/cpp/algorithm/next_permutation) that would transform a sorted list to your result. – Drew Dormann May 08 '15 at 17:49
  • 2
    http://math.stackexchange.com/questions/1089078/encode-order-of-playing-cards-data-compression – Eric J. May 08 '15 at 17:50
  • @ πάντα ῥεῖ: Thank you. Actually i have not made any implement yet. But my idea is very simple, so the compressed size will always be able to calculated , and it will not get impact by the implementation aproach. – UenX May 08 '15 at 17:53
  • 1
    You could improve compression using [phased-in codes](http://www.davidsalomon.name/VLCadvertis/phasedin.pdf). – Evgeny Kluev May 08 '15 at 17:57
  • 1
    This doesn't make good use of "fractional bits", you could use the same thing but with range-coding to improve it. – harold May 08 '15 at 18:07
  • 2
    @harold Very correct - when there is only 3 items remaining, each has a 33% change of being the next number. 1/3 doesn't perfectly translate to a bit count and therefore range coding vs a huffman like technique will save you some space. – Michael Dorgan May 08 '15 at 18:10
  • 1
    Going from 49-bits where you are now to 45-bits seems like, well, quite a bit of work. Even at 8-bits where the below formulas give a best case of 1684 bits, your solution will give 1792 bits - only ~10% off the maximum. Considering that you are going to need to code an arithmetic encoder to get closer to maximum, perhaps you stick with what you have. – Michael Dorgan May 08 '15 at 18:40
  • 2
    You just have permutation, and you needed encode permutation number. See detailed answer here: http://stackoverflow.com/questions/1506078/fast-permutation-number-permutation-mapping-algorithms – olegarch May 08 '15 at 23:37

3 Answers3

3

As is pointed out in comments, the optimal encoding -- if all permutations are equally probable -- is to replace the entire permutation with its index in the enumeration of permutations. Since there are n! possible permutations, the index requires log2n! bits, and therefore the compression ratio from the naive encoding using log2n bits for each element is (log n!)/(n log n).

Using Stirling's approximation, we can rewrite that as (n log n - n + O(log n))/(n log n), which is 1 - 1/(log n) + O(1/n) which evidently asymptotically approaches 1 as n grows. So it is inevitable that the compression ratio will decrease for larger n.

It is not possible to achieve better compression unless not all permutations are equally probable (and you have some information about the probability distribution).

rici
  • 234,347
  • 28
  • 237
  • 341
2

For this specific problem, the most efficient encoding is to view the permutation of [0 .. 2^N-1] as a numeral in the factorial number system and store the Lehmer code for that permutation.

This gives a requirement of ceil(log2((2^N)!)) bits. For N = 4, this uses 45 bits (70.3%); for N = 11 (2^N = 2048), 19581 bits (86.9%).

The compression ratio worsens as N increases; using the simple approximation log x! >= (x log x) - x + 1 we attain a minimum for log2((2^N)!) / (N 2^N) of 1 - ((2^N - 1)/(2^N))*(1 / (N * log(2))), which approaches 1 as N tends to infinity.

Given this absolute bound on compression ratio, any approach you can find which is reasonably efficient is worth going for; for values as small as N = 15 it's impossible to do better than 90%.

ecatmur
  • 152,476
  • 27
  • 293
  • 366
2

Currently you are using N*2^N bits.

Basically what you have is a permutation of the numbers, and each permutation is unique, and for permutation you can calculate a unique identifier. Since there are (2^N)! permutations, you will only need ceil(log2((2^N)!)) bits. For you example, this is 45 bits.

Karoly Horvath
  • 94,607
  • 11
  • 117
  • 176