Compressing unicode characters

Question

I am using GZIPOutputStream in my java program to compress big strings, and finally storing it in database.

I can see that while compressing English text, I am achieving 1/4 to 1/10 compression ration (depending on the string value). So say for example my original English text is 100kb, then on an average compressed text will be somewhere around 30kb.

But when I am compressing unicode characters, the compressed string is actually occupying more bytes than the original string. Say for example, my original unicode string is 100kb, then the compressed version is coming out to 200kb.

Unicode string example: "嗨，这是，短信计数测试持续for.Hi这是短"

Can anyone suggest that how can I achieve compression for unicode text as well? and why the compressed version is actually bigger than the original version?

My compression code in Java:

            ByteArrayOutputStream baos = new ByteArrayOutputStream();
            GZIPOutputStream zos = new GZIPOutputStream(baos);

            zos.write(text.getBytes("UTF-8"));
            zos.finish();
            zos.flush();

            byte[] udpBuffer = baos.toByteArray();

Actually the issue is not with unicode text. The problem is that compression doesn't work as expected if the length of the text is less (in my case 'this' less was around 100 bytes) — Arry, Apr 11 '14 at 13:51

score 2 · Answer 1 · edited Jun 20 '20 at 09:12

Java's GZIPOutputStream uses the Deflate compression algorithm to compress data. Deflate is a combination of LZ77 and Huffman coding. According to Unicode's Compression FAQ:

Q: What's wrong with using standard compression algorithms such as Huffman coding or patent-free variants of LZW?

A: SCSU bridges the gap between an 8-bit based LZW and a 16-bit encoded Unicode text, by removing the extra redundancy that is part of the encoding (sequences of every other byte being the same) and not a redundancy in the content. The output of SCSU should be sent to LZW for block compression where that is desired.

To get the same effect with one of the popular general purpose algorithms, like Huffman or any of the variants of Lempel-Ziv compression, it would have to be retargeted to 16-bit, losing effectiveness due to the larger alphabet size. It's relatively easy to work out the math for the Huffman case to show how many extra bits the compressed text would need just because the alphabet was larger. Similar effects exist for LZW. For a detailed discussion of general text compression issues see the book Text Compression by Bell, Cleary and Witten (Prentice Hall 1990).

I was able to find this set of Java classes for SCSU compression on the unicode website, which may be useful to you, however I couldn't find a .jar library that you could easily import into your project, though you can probably package them into one if you like.

score 0 · Answer 2 · edited May 23 '17 at 12:08

I don't really know Chinese, but as far as I know te GZIP compression depends on repeating sequences of text and those repeating sequences are changed with "descriptions" (this is a very high level explanation). This means if you have a word "library" on 20 places in a string the algorithm will store the word "library" on the side and than note that it should appear on places x, y, z... So, you might not have a lot of redundancy in your original string so you cannot save a lot. Instead, you have more overhead than savings.

I'm not really a compression expert, and I don't know the details, but this is the basic principle of the compression.

P.S This question might just be a duplicate of: Why gzip compressed buffer size is greater then uncompressed buffer?

Compressing unicode characters

2 Answers2