Java: Converting UTF 8 to String

Question

When I run the following program:

public static void main(String args[]) throws Exception
{
    byte str[] = {(byte)0xEC, (byte)0x96, (byte)0xB4};
    String s = new String(str, "UTF-8");
}

on Linux and inspect the value of s in jdb, I correctly get:

 s = "ì–´"

on Windows, I incorrectly get:

s = "?"

My byte sequence is a valid UTF-8 character in Korean, why would it be producing two very different results?

The windows command prompt cannot display UTF8 characters unless you change the codepage using `chcp` and you need to use a font that can display those characters. — , Oct 02 '12 at 21:21
Related http://stackoverflow.com/questions/8616915/java-console-charset-translation — leonbloy, Oct 02 '12 at 21:23

score 3 · Accepted Answer · answered Oct 02 '12 at 21:22

3

It correctly prints "어" on my computer (Ubuntu Linux), as described in Code Table Korean Hangul. Windows command prompt is known to have issues with encoding, don't bother.

Your code is fine.

answered Oct 02 '12 at 21:22

Tomasz Nurkiewicz

334,321
69
703
674

My mistake. The Korean characters were properly displaying in my Emacs text buffer so I naturally assumed that they would display properly in the Emacs shell buffer. Which as folks pointed out, they do not. – kujawk Oct 02 '12 at 21:34

score 1 · Answer 2 · answered Oct 02 '12 at 21:20

1

It gives 어 for me. This means your console is probably not configured to display UTF-8 and it is a printing/display problem, rather than a problem with representation.

answered Oct 02 '12 at 21:20

Bozho

588,226
146
1,060
1,140

score 1 · Answer 3 · answered Oct 02 '12 at 21:21

1

You get the correct string, it's Windows console that does not display the string correctly.

Here is a link to an article that discusses a way to make Java console produce correct Unicode output using JNI.

answered Oct 02 '12 at 21:21

Sergey Kalinichenko

714,442
84
1,110
1,523

score 0 · Answer 4 · answered Oct 02 '12 at 21:35

JDB is displaying the data incorrectly. The code works the same on both Windows and Linux. Try running this more definitive test:

public static void main(String[] args) throws Exception {
    byte str[] = {(byte)0xEC, (byte)0x96, (byte)0xB4};
    String s = new String(str, "UTF-8"); 
    for(int i=0; i<s.length(); i++) {
        System.out.println(BigInteger.valueOf((int)s.charAt(i)).toString(16));
    }
}

This prints out the hex value of every character in the string. This will correctly print out "c5b4" in both Windows and Linux.

Java: Converting UTF 8 to String

4 Answers4