1

I am looking for String length compression to avoid lengthy filename as below. The string contains UTF-8 characters as well.

"dt=20200623_isValid=valid_module_name=A&B&C_data_source=internet_part-00001-1234-9d12-1234-123d-1234567890a1.b001.json"

Tried Huffman compression from GitHub here, it reduces the size but not much on the String length.

Size before compression: 944

Size after compression: 569

Compressed string: 01011111001111100011101000111011101011001000111110001101000011011001000110001111010001010111111001010110001111010001010001101101010000101101110001110000000110101011010110100000111111001101011111100111101111110100000010101011011110011000010011001000101110010011101001000001111101001010111110000001001101010000111100001110101001100100111110001011101110111011101001001010011000111110111000101100000101100110000010100110001111101110001010011000111110101001010011000111110111011010111011001101100110110111000011100110100111000111011101110111010011100011101111001100100010101

Please advise how to achieve length compression in Java? (The decompressed file Name value is needed for further processing).

Vasanth Subramanian
  • 1,040
  • 1
  • 13
  • 32
  • Am I correct in assuming that you have converted your compressed bytes to a string where each digit represents a bit in the compressed data? If so, why does it start with an extra `0` followed by the expected multiple of 8 bits? – VGR Jun 23 '20 at 17:31
  • @VGR, I used the HuffmanCoding.java from the GitHub portal link shared in my question. I tried debugging this class. It iterates through each character in the string. For the first character in the string ("d"), compressed value is 0101 and thereby appends the compressed character at the end for further characters in the string. – Vasanth Subramanian Jun 23 '20 at 17:48

2 Answers2

1

You should try ZLIB/GZ Compression. You can find GZ compression snippet here compression and decompression of string data in java

ZLIB compression implementation is also fairly easy. You can use the below code as a starter and improve upon it.

Detailed explanation on compressions How are zlib, gzip and zip related? What do they have in common and how are they different?

Read Deflator strategies before proceeding ahead: Java Deflater strategies - DEFAULT_STRATEGY, FILTERED and HUFFMAN_ONLY

public void compressFile(String originalFileName, String compressedFileName) {
    try (FileInputStream fileInputStream = new FileInputStream(originalFileName);
         FileOutputStream fileOutputStream = new FileOutputStream(compressedFileName);
        DeflaterOutputStream deflaterOutputStream = new DeflaterOutputStream(fileOutputStream))
    {
        int data;
        while ((data = fileInputStream.read()) != -1) {
            deflaterOutputStream.write(data);
        }
    } catch (IOException e) {
        e.printStackTrace();
    }
}

You can decompress using Inflator.

public void decompressFile(String fileTobeDecomporessed, String outputfile) {
    try (
            FileInputStream fileInputStream = new FileInputStream(fileTobeDecomporessed);
            FileOutputStream fileOutputStream = new FileOutputStream(outputfile);
            InflaterInputStream inflaterInputStream = new InflaterInputStream(fileInputStream)) {
        int data;
        while ((data = inflaterInputStream.read()) != -1) {
            fileOutputStream.write(data);
        }
    } catch (IOException e) {
        e.printStackTrace();
    }
}

Refer: http://cr.openjdk.java.net/~iris/se/11/latestSpec/api/java.base/java/util/zip/Deflater.html

Govinda Sakhare
  • 5,009
  • 6
  • 33
  • 74
1

Of course using one character per binary digit is going to use up a lot of space. That library is using 16 bits (the size of a char) to represent a single bit, so it is literally making its result 16 times larger than it needs to be.

A far more compact way to represent binary data is by converting it to hexadecimal.

byte[] compressedBytes = new BigInteger(compressedString, 2).toByteArray();

Formatter formatter = new Formatter();
for (byte b : compressedBytes) {
    formatter.format("%02x", b);
}
String hex = formatter.toString();

Then the result is 142 bytes:

BE7C7477591F1A1B231E8AFCAC7A28DA85B8E0356B41F9AFCF7E8156F30991727483E95F026A1E1D4C9F17777494C7DC582CC14C7DC531F5298FBB5D9B36E1CD38EEEE9C779915

You could even go a step farther and Base64 encode it, reducing the result to 96 bytes:

String s = Base64.getEncoder().encodeToString(compressedBytes);

Result:

AL58dHdZHxobIx6K/Kx6KNqFuOA1a0H5r89+gVbzCZFydIPpXwJqHh1Mnxd3dJTH3FgswUx9xTH1KY+7XZs24c047u6cd5kV

VGR
  • 40,506
  • 4
  • 48
  • 63
  • 1
    Thanks for your help! I am mainly looking for reduction in the character count of the string. We are asked to embed the file path as part of file name. As the length of the file path is too long and there is a restriction in maximum file name length, I am looking for compression mainly for the character count. Please let me know how to achieve this? – Vasanth Subramanian Jun 24 '20 at 12:22
  • Why can’t you use the string values I’ve provided? – VGR Jun 24 '20 at 13:29