Different codepoints for same character in MacOS and Windows

Question

I have a small piece of code in which I am checking the codepoint for the the character Ü.

Locale lc = Locale.getDefault();
System.out.println(lc.toString());
System.out.println(Charset.defaultCharset());
System.out.println(System.getProperty("file.encoding"));
String inUnicode = "\u00dc";
String glyph = "Ü";
System.out.println("inUnicode " + inUnicode + " code point " + inUnicode.codePointAt(0));
System.out.println("glyph " + glyph + " code point " + glyph.codePointAt(0));

I am getting different value for codepoint when I run this code on MacOS x and Windows 10, see the output below.

Output on MacOS

en_US
UTF-8
UTF-8
inUnicode Ü code point 220
glyph Ü code point 220

Output on Windows

en_US
windows-1252
Cp1252
in unicode Ü code point 220
glyph ?? code point 195

I checked the codepage for windows-1252 at https://en.wikipedia.org/wiki/Windows-1252#Character_set, here the codepoint for Ü is 220. For String glyph = "Ü"; why do I get codepoint as 195 on Windows? As per my understanding glyph should have been rendered properly and the codepoint should have been 220 since it is defined in Windows-1252.

If I replace String glyph = "Ü"; with String glyph = new String("Ü".getBytes(), Charset.forName("UTF-8")); then glyph is rendered correctly and codepoint value is 220. Is this the correct and efficient way to standardize behavior of String on any OS irrespective of locale and charset?

Remy Lebeau · Accepted Answer · 2018-11-01T01:18:07.447

0

195 is 0xC3 in hex.

In UTF-8, Ü is encoded as bytes 0xC3 0x9C.

System.getProperty("file.encoding") says the default file encoding on Windows is not UTF-8, but clearly your Java file is actually encoded in UTF-8. The fact that println() is outputting glyph ?? (note 2 ?, meaning 2 chars are present), and that you are able to decode the raw string bytes using the UTF-8 Charset, proves this.

glyph should have a single char whose value is 0x00DC, not 2 chars whose values are 0x00C3 0x009C. getCodepointAt(0) is returning 0x00C3 (195) on Windows because your Java file is encoded in UTF-8 but is being loaded as if it were encoded in Windows-1252 instead, so the 2 bytes 0xC3 0x9C get decoded as characters 0x00C3 0x009C instead of as character 0x00DC.

You need to specify the actual file encoding when running Java, eg:

java -Dfile.encoding=UTF-8 ...

edited Nov 01 '18 at 01:18

answered Oct 31 '18 at 21:10

Remy Lebeau

555,201
31
458
770

Thanks for explaining why `195` is returned on windows. I tired `java -Dfile.encoding=UTF-8` on Windows but it didn't work, `glyph` is still `??`. Although the flag did change the charset to `UTF-8`. Also `unicode.equals(glyph)` should ideally evaluate to `true` (which is the case on macOS), but without decoding the raw string bytes using UTF-8 it evaluates to `false` on Windows. – pradystar Nov 01 '18 at 00:44
"*I tired `java -Dfile.encoding=UTF-8` on Windows but it didn't work, `glyph` is still `??`*" - it shouldn't be. That means it is still being processed as Windows-1252. Try `=UTF8` instead of `=UTF-8`, though [both should work](https://stackoverflow.com/questions/6031877/). – Remy Lebeau Nov 01 '18 at 01:22
It seems like the character encoding had been cached. See https://stackoverflow.com/questions/361975/setting-the-default-java-character-encoding. I was able to get it working after setting the flag in environment variable `JAVA_TOOL_OPTIONS`. Thanks again for the explanation. However, rather than using the flag isn't it better and safe option to use `getBytes` – pradystar Nov 01 '18 at 02:59
@pradystar "*isn't it better and safe option to use `getBytes`*" - no, because the string contents have already been messed up, risking data loss, as soon as the source file is parsed in the wrong encoding, before you ever have a chance to call `getBytes()`. Best to tell Java the correct file encoding up front so the string contents are correct to begin with. `getBytes()` is a hack, not a real solution. If you don't want to fix the encoding, then you have to use `\u` escapes in your literals to avoid the issue altogether. – Remy Lebeau Nov 01 '18 at 03:10

Different codepoints for same character in MacOS and Windows

1 Answers1