It seems like java fails to correctly encode Strings when ProcessBuilder or Runtime.exec pass them along to the process they spawn, even with -Dfile.encoding set - for reasons that I don't understand. This means high codepoint characters (Chinese, Japanese etc) don't passed along to the child process.
As a simple example, compile the following two test classes, substituting in your own jre in Test1 and whatever file path you like in Test2:
import java.io.IOException;
import java.nio.charset.Charset;
public class Test1 {
public static void main(String[] args) throws IOException {
String s = "因";
System.out.println(bytesToHex(s.getBytes(Charset.forName("UTF-8"))));
Runtime.getRuntime().exec(new String[]{"C:\\Program Files\\Java\\jdk1.6.0_45\\bin\\java.exe", "-cp", ".", "Test2", s});
}
public static String bytesToHex(byte[] bytes) {
char[] hexArray = "0123456789ABCDEF".toCharArray();
char[] hexChars = new char[bytes.length * 2];
for ( int j = 0; j < bytes.length; j++ ) {
int v = bytes[j] & 0xFF;
hexChars[j * 2] = hexArray[v >>> 4];
hexChars[j * 2 + 1] = hexArray[v & 0x0F];
}
return new String(hexChars);
}
}
and
import java.io.FileWriter;
import java.io.IOException;
import java.nio.charset.Charset;
public class Test2 {
public static void main(String[] args) throws IOException {
FileWriter w = null;
try {
w = new FileWriter("<some directory>\\testoutput.txt");
w.write(Test1.bytesToHex(args[0].getBytes(Charset.forName("UTF-8"))));
} finally {
if (w != null) w.close();
}
}
}
Then run Test1:
java -Dfile.encoding=UTF-8 Test1
Observe that Test1 prints out "E59BA0", whilst Test2 writes "3F" ('?') to file.
Can anyone explain why this is, and what the correct way to accomplish what I want to accomplish is?