Java UTF-8 encoding produces incorrect output

Question

In Java, I've been trying to write a String to a file using UTF-8 encoding which will later be read by another program written in a different programming language. While doing so I noticed that the bytes created when encoding a String into a byte array didn't seem to have the correct byte values.

I narrowed down the problem to the symbol "£" which seems to produce incorrect bytes when encoded to UTF-8

byte[] byteArray = "£".getBytes(Charset.forName("UTF-8"));

// Print out the Byte Array of the UTF-8 converted string
// Upcast byte values to print the bytes as unsigned
for (byte signedByte : byteArray) {
  System.out.print((signedByte & 0xFF) + " ");
}

This outputs 6 bytes with the decimal values: 239 190 130 239 189 163, in hex this is: ef be 82 ef bd a3

http://www.utf8-chartable.de/ however says that the values for "£" in hex is: c2 a3, the output should then be: 194 163

Other strings seem to produce correct bytes when encoded as UTF-8, so I'm wondering why Java is producing these 6 bytes for "£", and how I should go about properly converting by Strings to byte arrays using UTF-8 encoding

I have also tried

OutputStreamWriter out = new OutputStreamWriter(new FileOutputStream(outputFile), "UTF-8");
out.write("£");
out.close();

but this produced the same 6 bytes

I copied and pasted your code, and it produces 194 163, as expected. I doubt you're actually encoding a single £ character to get 6 bytes. Which platform and JDK are you using? Are you sure the code executed is the code you posted? — JB Nizet, Mar 01 '14 at 20:58
My full code is http://pastebin.com/uV3m6Pri I'm using jdk1.7.0_51 and I'm running windows 7 64bit — user3369258, Mar 01 '14 at 21:02
@user3369258: But what encoding is your source file in, and how are you compiling? Try replacing each occurrence of `£` with `\u00a3` in your code, and I'm sure you'll find it works. — Jon Skeet, Mar 01 '14 at 21:04

score 5 · Accepted Answer · answered Mar 01 '14 at 21:00

5

I suspect the problem is that you're using a string literal in your Java code using an editor which writes it out in one encoding - but then you're compiling without specifying the same encoding. In other words, I suspect that your "£" string is not actually a single pound sign at all.

This should be easy to validate. For example:

char[] chars = "£".toCharArray();
for (char c : chars) {
    System.out.println((int) c);
}

To take this out of the equation, you can specify the string using a pure-ASCII representation using a Unicode escape sequence:

String pound = "\u00a3";
// Now encode as before

I'm sure you'll then get the right bytes. For example:

import java.nio.charset.Charset;

class Test {
    public static void main(String[] args) throws Exception {
        String pound = "\u00a3";
        byte[] bytes = pound.getBytes(Charset.forName("UTF-8"));
        for (byte b : bytes) {
            System.out.println(b & 0xff); // 194, 163
        }
    }
}

answered Mar 01 '14 at 21:00

Jon Skeet

1,421,763
867
9,128
9,194

Thank you, writing it out using the Unicode escape sequence worked! Running your first code block outputted: 65410 65379 – user3369258 Mar 01 '14 at 21:10
Great answer! Didn't think about the mismatch between the file and compiler encodings. – JB Nizet Mar 01 '14 at 21:13
1

To add to this solution, I believe what was causing the problem described by Jon Skeet was because my System Locale was actually in Japanese, this set my default file encoding to "MS932" and defaultCharset to "windows-31j" instead of "UTF-8". By changing the environmental variable JAVA_TOOL_OPTIONS value to -Dfile.encoding=UTF8, I managed to have the JVM start with the default encodings set to UTF-8 rather than the system defaults, and the program worked without using the Unicode escape sequence! Check out http://stackoverflow.com/questions/361975/setting-the-default-java-character-encoding – user3369258 Mar 01 '14 at 23:33

Java UTF-8 encoding produces incorrect output

1 Answers1

Linked