Eclipse debugger showing cjk character wrongly - Java

Question

I am working on a bug where a CJK character is displayed wrongly. I am testing using a simple programme in Eclipse and while debugging 'Variables' section of debugger displays one CJK character wrongly. See the screenshot below. I just assigned "野家xyz" value to a variable and eclipse debugger is displaying it wrongly. Character '', which is a surrogate pair, is replaced with a square. But when I printed it using sysout, it is displayed correctly. Default charset used is 'UTF-8' as you can see from the first line printed in the console. Can someone help me to understand why eclipse is showing it wrongly ?

Eclipse showing CJK character wrongly:

It is not just a square, there is a small '.' as well. It looks like it is showing the two parts of the pair separately for some reason. Note that the Charset setting is not relevant to this. — greg-449, May 12 '22 at 09:42
Even the [Unicode website](https://util.unicode.org/UnicodeJsps/character.jsp?a=20BB7&B1=Show) has no representation for this, so probably not a bug — g00se, May 12 '22 at 09:46
@g00se You mentioned Unicode website has no representation for this, but the link you provided is showing a representation, may i know why you said so? — Justin Mathew, May 12 '22 at 15:10
Our browsers must be working differently. The box where the glyph should be is empty on my machine — g00se, May 12 '22 at 15:20
@g00se So now that we agree our browsers must be working differently. What if the a code which does encoding and encoding works well on one webserver and not in another webserver. It is a possible issue isn't it? — Justin Mathew, May 13 '22 at 04:56
Yes, browsers could be a problem, but is that relevant to what your debugger is showing? I would guess that the font in the debugger doesn't have the glyph. You could perhaps somehow change the debugger's font to something that does, e.g. the font 'Code2002' — g00se, May 13 '22 at 08:49

Till Brychcy · Answer 1 · 2022-06-01T08:29:27.297

The character "" is what Unicode calls a supplementary character with codepoint U+20BB7 and its UTF-8 encoding is F0 A0 AE B7.

Support for such characters has only been added to Java in Version 1.5 by JSR 204, but the code in Eclipse's jdt.debug that reads Strings in UTF-8 format is older than that.

If you look at the implementation of org.eclipse.jdi.internal.jdwp.JdwpString.read(DataInputStream), you can see that is was never updated to handle supplementary characters (which have four byte sequences starting with 0xF*).

It just checks that the upper nibble of the first byte is >= 14 (0xE), effectively interpreting the character's UTF-8 sequence as E0 A0 AE B7 which corresponds to the sequence U+082E U+00B7. U+082E is not a valid unicode character which is why the rectangle is drawn for it.

If you want to report this issue, the bug tracker for this Eclipse component is here.

score 0 · Answer 2 · answered May 31 '22 at 12:38

Looks like this is a bug in Eclipse IDE, Variables window.

I have added a detail formatter to get the unicode entities for the text "野家xyz". Then decoded the returned unicode entities to unicode text using an online tool. Here's the outputs I got.

Detail Formatter Code

String unicodeStr = "";
for (int i = 0; i < this.length(); i++)
    unicodeStr +=  "\\u" + Integer.toHexString(this.charAt(i) | 0x10000).substring(1);
return unicodeStr;

Detail Formatter Output

\ud842\udfb7\u91ce\u5bb6\u0078\u0079\u007a

Screenshots

I used this online unicode converter to check the result.

Looks like the data in the variable still corresponds to the correct text, but the IDE can't render it. So I think this should be a bug in Eclipse.

Eclipse debugger showing cjk character wrongly - Java

2 Answers2