0

I am looking for a compression algorithm for integers of 10 digits (ex: 5289412587), the aim would be to reduce the compressed result to a number with the fewest digits possible (ex: 125), an algorithm that is reversible, therefore from the result I should then reconstruct the original number, is there something that is right for me? Thank you.

Noname
  • 56
  • 6
  • Does this answer your question? [An efficient compression algorithm for short text strings](https://stackoverflow.com/questions/1138345/an-efficient-compression-algorithm-for-short-text-strings) – Federico klez Culloca Sep 09 '22 at 18:24
  • By the way, out of curiosity: you've been asking these kind of question for close to two weeks. Are you doing some weird exercise, asking out of intellectual curiosity or are you trying to do something practical? Because if it's the latter telling us what you're trying to do may result in better answers and suggestions. – Federico klez Culloca Sep 09 '22 at 18:26
  • I am looking for a way inside a java program to compress and decompress a number, I can not rely on other languages ​​to compress the string, I thought there was some algorithm applicable in java that given a number in input returns a shorter and reversible output to be used for place of the larger number, but knowing that it is that larger but compressed number. – Noname Sep 09 '22 at 19:14
  • thanks for the curiosity, I'm creating a java program that analyzes images over time and defines aspects of the past, according to some mathematical functions and other similar images, im using a lot of variables that act as image containers. It is something weird but very useful for reconstructing past images. – Noname Sep 09 '22 at 19:16
  • if you have an integer value `n`,such that 0 <= n <= m, `n` can be any value between 0 and m, and the values of `n` are evenly distributed, it isn't possible to have a lossless reversible compression algorithm for `n`. – Old Dog Programmer Sep 09 '22 at 19:35
  • Regarding my previous comment, there's another requirement: For what I wrote to be true, the values of `n` should be independent of each other. I'll update my answer. – Old Dog Programmer Sep 10 '22 at 10:42

2 Answers2

4

You can't compress all 10-digit numbers to something smaller, and be able to decompress them back to the originals. This should be obvious. If you compressed them, say, to 9-digit numbers, then when you decompressed all of those 9-digit numbers, of which there are 109, you could only get 1/10th of the original numbers, of which there are 1010. 90% of the original numbers would not be reproduced.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158
  • not even building libraries that use other libraries by going to "encrypt" the digits with other smaller digits? – Noname Sep 10 '22 at 10:17
  • 1
    Encryption is not compression. What exactly is a "smaller" digit? – Mark Adler Sep 10 '22 at 22:09
  • encryption or compression doesn't change so much for me, I just have to find a way to reduce a number like this: 5478215981, to a smaller number like this: 524, knowing that from 524 i can reconstruct the number 5478215981. I should use two of these large numbers in the Cantor pairing function, and the result is too big for me to handle, I should work with smaller numbers, knowing that they equal larger numbers. 524 -> 5478215981, that's my problem – Noname Sep 10 '22 at 23:39
  • "smaller digits" means that i need to reduce the number lenght from a max of 10 chars, to a max of 3 chars. But in the same time, i need a link between the smaller number to the bigger. – Noname Sep 10 '22 at 23:41
  • 2
    Ok, by "smaller digits", you mean a smaller _number_ of digits. Do you mean 3 **characters** or 3 **digits**? There's a big difference. – Mark Adler Sep 11 '22 at 00:12
  • 1
    As I said, it is impossible to map _all_ of the 10-digit numbers uniquely to 9-digit numbers. Let alone 3-digit numbers. – Mark Adler Sep 11 '22 at 00:14
  • One way to compress an integer represented as a `String` or other character sequence is to convert it to binary. – Old Dog Programmer Sep 13 '22 at 18:40
1

Version 2: Added the possibility of compression based on knowing values are related.

No, and maybe.

Suppose the values you want to compress are represented by the variable n, and 0 <= n <= m. If the values of n are evenly distributed over the range, all values in the range are possible, and each value is independent of the others, then it is not possible to have a lossless compression algorithm.

However, it might be that the possible values of n in your data are not evenly distributed. That is, some values are common, but others values are rare. Maybe some don't occur. If that is the case, there is a possibility you could use Huffman coding or something similar.

Huffman compression represents values using a variable number of bits for the values to be represented: The more common values are represented by fewer bits, the less common values by more bits.

One way to use Huffman coding is to sample your data, and create a "standard" table for use in all your coding and decoding of n. This can reduce processing time. If you don't create a "standard" table, you would build a table each time you compress by running through your data and looking at the values of n, count how many times each value occurs, and sort by number of occurrences. This requires more processing, but the table will be fine-tuned to that particular run. It also requires the resulting table be carried with the compressed data.

But, if your data doesn't have values that are more common than other values, Huffman coding won't be useful.

Another, and perhaps more likely possibility in your case, is compression based on knowledge of interrelationship of the values. If there are circumstances that allow you to predict one value given information about another value, it might be possible to use that for data compression. I can best explain that with two examples.

About 11 years ago, I was looking at the Topcoder (topcoder.com) website. There was a contest there to compress data from DNA sequencing. The winning algorithms did this: For each block of data, create a new block with the same length (number of elements). The first value in the new block was a copy of the first value in the original block. But, each subsequent value v[n] in the new block would be the difference between two adjacent values in the original block: new block v[n] = original block v[n] - original block v[n - 1]. The set of new blocks would then be compressed using Huffman coding.

That worked because the difference between the minimum and maximum difference between successive values was much smaller than the range of an individual value, and some differences were a lot more common than others.

Another example was for medical image processing. An algorithm looked at images from MRI or CT scans for closed loops. When it found a closed loop, it recorded, in order, the positions of each pixel in the loop. The data was sent to another lab for analysis.

A block of compressed data would represent one loop. A block began with a copy of the location of the first pixel. The location of each subsequent pixel was coded relative to the previous pixel in the sequence. Because a loop was continuous, each pixel in the loop had to be orthogonally or diagonally adjacent to the previous. This allowed relative locations to be coded using 3 bits. For one loop, the compression / decompression algorithm ran until the calculated position of a pixel was the same as the original pixel. Then, the next block would be processed.

Old Dog Programmer
  • 901
  • 1
  • 7
  • 9
  • Huffman encoding is (or at least was) common for compressing files that are mostly text. – Old Dog Programmer Sep 10 '22 at 11:21
  • A lecture on Huffman coding had a proof that no compression algorithm could do better. But, the proof relied on certain assumptions. One was the values were independent of each other. In compressing text, some letter combinations occur more than others. Within a single syllable, for example, a consonant is likely to be followed by a vowel. Knowing that can be used to achieve higher compression than Huffman coding. – Old Dog Programmer Sep 10 '22 at 15:28
  • Very interesting the idea of ​​the blocks for the DNA, but I did not understand if how to apply this logic to my case, how can I reduce a number like this 548712548 to a number like this: 547? Still being able to reconstruct the number 548712548 from this 547? – Noname Sep 10 '22 at 23:44
  • I also saw on the internet that someone uses the number to "shrink" as an array with pairs of numbers: "123456" -> "12" - "34" - "56", and go and try all the possible seeds of a function " rand" that give that result, then use the seed of the function to reconstruct the original number. It might work? – Noname Sep 10 '22 at 23:47
  • I have no idea what the numbers you are dealing with are. It may be lossless compression is impossible for them. Both examples relied on the fact that the values were not independent: Knowing the value of `v[n]` narrowed the possible values of `v[n+1]` – Old Dog Programmer Sep 11 '22 at 02:06
  • In your "123456" -> "12" - "34" - "56" example, if the numbers are stored as `int`, 123456 would be the compressed value of 3 numbers, 12, 34, and 56. – Old Dog Programmer Sep 11 '22 at 14:48
  • will not work you need to save more bytes for the seed than for the number – TTho Einthausend Sep 12 '22 at 21:40
  • I thought of an example similar to the DNA one: Suppose I have a map as a 2D array. Each cell contains an elevation: Distance above (or below) sea level. If resolution is high, in most cases, the difference between adjacent cells will likely be a few feet. (A large difference between adjacent pixels indicates a cliff.) – Old Dog Programmer Sep 13 '22 at 20:57