GZIPInputStream and Characterset

Question

I have a Text with Latin, Cyrillic and Chinese Characters containing. I try to compress a String (over bytes[]) with GZIPOutputStream and decompress it with GZIPInputStream. But I do not manage to convert all Characters back to the original Characters. Some appear as ?.

I thought that UTF-16 will do the job.

Any help?

Regards

Here's my code:

import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.UnsupportedEncodingException;
import java.util.zip.DataFormatException;
import java.util.zip.Deflater;
import java.util.zip.GZIPInputStream;
import java.util.zip.GZIPOutputStream;
import java.util.zip.Inflater;
import java.util.zip.ZipException;

public class CompressUncompressStrings {

    public static void main(String[] args) throws UnsupportedEncodingException {

        String sTestString="äöüäöü 长安";
        System.out.println(sTestString);
        byte bcompressed[]=compress(sTestString.getBytes("UTF-16"));
        //byte bcompressed[]=compress(sTestString.getBytes());
        String sDecompressed=decompress(bcompressed);
        System.out.println(sDecompressed);
    }
    public static byte[] compress(byte[] content){
        ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
        try{
            GZIPOutputStream gzipOutputStream = new GZIPOutputStream(byteArrayOutputStream);
            gzipOutputStream.write(content);
            gzipOutputStream.close();
        } catch(IOException e){
            throw new RuntimeException(e);
        }
        return byteArrayOutputStream.toByteArray();
    }
    public static String decompress(byte[] contentBytes){

        String sReturn="";
        ByteArrayOutputStream out = new ByteArrayOutputStream();
        try{
            GZIPInputStream gzipInputStream =new GZIPInputStream(new ByteArrayInputStream(contentBytes));
             ByteArrayOutputStream baos = new ByteArrayOutputStream();
             for (int value = 0; value != -1;) {
                 value = gzipInputStream.read();
                 if (value != -1) {
                     baos.write(value);
                 }
             }
             gzipInputStream.close();
             baos.close();
             sReturn=new String(baos.toByteArray(), "UTF-16");
             return sReturn;
                 // Ende Neu

        } catch(IOException e){
            throw new RuntimeException(e);
        }
    }
}

What does that `System.out.println(sTestString);` say? If it also displays junk, then you definitely have a problem with stdout encoding. You'd need to tell what environment you're using (Windows command prompt? Eclipse IDE? etc) so that we can tell how to configure that properly. — BalusC, Aug 15 '11 at 14:35

score 1 · Answer 1 · answered Aug 15 '11 at 14:32

I suspect it's just the console that's having a problem. I tried the above code, and although it didn't print out any of the characters properly, when I tested the round-tripping of the string, it was fine:

System.out.println(sDecompressed.equals(sTestString)); // Prints true

What does that do on your machine?

score 1 · Answer 2 · answered Aug 15 '11 at 14:38

1

Displaying an non ASCII character on a console output is not easy. Assuming you're using Windows as your operating system (since the command line doesn't support Unicode by default), you can change your active code page number (using the chcp command). I don't know how it's done through code but I suggest running the code on command line.

This chcp value 65001 changes to tell windows to use UTF-8 on it's console (you can view a discussion here).

I hope this helps.

answered Aug 15 '11 at 14:38

Buhake Sindi

87,898
29
167
228

Then you'd still need a command console font which supports those characters. – BalusC Aug 15 '11 at 14:41
@BalusC, true, if your OS doesn't support the code page 65001. I didn't say it's an easy thing. :) – Buhake Sindi Aug 15 '11 at 14:49
Windows definitely supports it. It's just the lack of a command console font which can show all of Unicode characters. Best what you can get is Lucida Console Unicode. But it doesn't have for example Chinese glyphs. – BalusC Aug 15 '11 at 14:53

GZIPInputStream and Characterset

2 Answers2