Compress a long string to a shorter one which is unique using Java

Question

The code dynamically creates a file name which sometimes exceed the specified 255 characters length for the file name. So it needs to be cropped into a unique short length string in Java. Below are the restrictions:

The shortened string will be used as a file name
The reverse mechanism of getting back the original file name from the compressed file name should also work
The compression/decompression will be done on-the-fly. ie., they should be done everytime the file is saved or retrieved.

Tried approaches:

Use the UUID class to generate unique identifier. But it fails the 2nd condition above - unable to get the original string from the UUID
Inflater/Deflater, Inflater/Deflater Streams, GZIP Streams - The compressed string contains many junk characters which cannot be used as a file name. Also not sure about the uniqueness of the string

Please suggest the best approach for this problem

Update: The solutions from the question marked as duplicate gives strings which has characters which cannot be used as a file name or bit streams. They do not give solution to the actual problem

I would either compress then convert the string to base-64, although there's not a guarantee of the length, or if you have `a-loo...ooong-filename.png`, hash the filename (or use the UUID as you tried) to `0123456789abcdef.png`, then have a `0123456789abcdef.png.txt` that contains the full name. — Ken Y-N, Feb 13 '18 at 05:59
Suggestions: 1. You can use current date in long format for unique name, 2. You can keep occurrences of each letter in your compressed file, later reading file you can do the reverse to get original string this uses less space — Vighanesh Gursale, Feb 13 '18 at 05:59
Not really a dup because the output symbol set is the same as the input symbol set. Compression works only because the output symbol set is larger than the input, ***and*** the output is binary, i.e. you can use the entire 256 values in a byte without regard to validity in a "filename". Huffman coding might reduce the length but then you'd have to encode the resulting binary string back into characters valid in a filename (i.e. base64), which would probably make the result larger than the original. The requirements are completel unrealistic. — Jim Garrison, Feb 13 '18 at 05:59
My comment above was going to be an answer, since the requirements are unattainable, but the question got put on hold before I could finish typing. — Jim Garrison, Feb 13 '18 at 06:01
The result of Huffman encoding is a bit stream. This must be divided into bytes and then re-encoded in filename-valid characters. I doubt the result will have any savings, and will probably be longer. — Jim Garrison, Feb 13 '18 at 06:10
This is not possible unless you do something like additionally storing "original file name - compressed file name" mapping somehow. — lexicore, Feb 18 '18 at 20:58

score 1 · Answer 1 · answered Feb 18 '18 at 20:47

1

The problem as stated is not solvable. You would need to be able to compress any sequence of symbols from a set to a smaller length sequence of symbols from the same set, and then losslessly decompress it. Since there are more possible longer sequences than shorter sequences, any mapping of all longer sequences to shorter sequences must map at least two longer sequences to a single smaller sequence. Then it is not possible to know which longer sequence it was that made the smaller sequence, so it you cannot losslessly decompress it.

The only way your problem could be solvable is if you somehow constrained the content and therefore the number of possible file names to be less than or equal to n²⁵⁵, where n is the number of different characters allowed in a file name.

answered Feb 18 '18 at 20:47

Mark Adler

101,978
13
118
158

can you explain this part "Since there are more possible longer sequences than shorter sequences, any mapping of all longer sequences to shorter sequences must map at least two longer sequences to a single smaller sequence. " Perhaps with an illustration? It's not obvious why this is the case – dozer Sep 14 '18 at 20:20
1

There are four possible two-bit sequences in what we'll call set A. If I map two-bit sequences to one-bit sequences in set B, for which there are two possible sequences, I will have to map at least two distinct sequences from set A to the same sequence in set B. This remains true if set A is the set of all 1001-bit sequences, and set B is the set of all 1000-bit sequences. There are twice as many sequences in A as in B, so there must be distinct sequences in A that are mapped to the same sequence in B. – Mark Adler Sep 14 '18 at 21:09

Compress a long string to a shorter one which is unique using Java

1 Answers1