Java print unicode glitch

Question

I am currently writing a program to read java class files. At the moment, I am reading the Constant-Pool of the class file (read here) and printing it to console. But when It get's printed, some of the unicode seems to mess up my terminal in such a way, that It looks like this (in case it matters, the class-file i'm reading is a compiled from Kotlin, and the Terminal I am using is the IntelliJ IDEA terminal, though it doesn't seem to glitch out when using the regular Ubuntu terminal.): The thing I noticed is a weird Unicode-Sequence which might be some kind of escape-sequence, I think.

Here is the entire output without the strange unicode sequence:

{1=UTF8: (42)'deerangle/decompiler/main/DecompilerMainKt', 2=Class index: 1, 3=UTF8: (16)'java/lang/Object', 4=Class index: 3, 5=UTF8: (4)'main', 6=UTF8: (22)'([Ljava/lang/String;)V', 7=UTF8: (35)'Lorg/jetbrains/annotations/NotNull;', 8=UTF8: (4)'args', 9=String index: 8, 10=UTF8: (30)'kotlin/jvm/internal/Intrinsics', 11=Class index: 10, 12=UTF8: (23)'checkParameterIsNotNull', 13=UTF8: (39)'(Ljava/lang/Object;Ljava/lang/String;)V', 14=Method name index: 12; Type descriptor index: 13, 15=Bootstrap method attribute index: 11; NameType index: 14, 16=UTF8: (12)'java/io/File', 17=Class index: 16, 18=UTF8: (6)'<init>', 19=UTF8: (21)'(Ljava/lang/String;)V', 20=Method name index: 18; Type descriptor index: 19, 21=Bootstrap method attribute index: 17; NameType index: 20, 22=UTF8: (15)'getAbsolutePath', 23=UTF8: (20)'()Ljava/lang/String;', 24=Method name index: 22; Type descriptor index: 23, 25=Bootstrap method attribute index: 17; NameType index: 24, 26=UTF8: (16)'java/lang/System', 27=Class index: 26, 28=UTF8: (3)'out', 29=UTF8: (21)'Ljava/io/PrintStream;', 30=Method name index: 28; Type descriptor index: 29, 31=Bootstrap method attribute index: 27; NameType index: 30, 32=UTF8: (19)'java/io/PrintStream', 33=Class index: 32, 34=UTF8: (5)'print', 35=UTF8: (21)'(Ljava/lang/Object;)V', 36=Method name index: 34; Type descriptor index: 35, 37=Bootstrap method attribute index: 33; NameType index: 36, 38=UTF8: (19)'[Ljava/lang/String;', 39=Class index: 38, 40=UTF8: (17)'Lkotlin/Metadata;', 41=UTF8: (2)'mv', 42=Int: 1, 43=Int: 11, 44=UTF8: (2)'bv', 45=Int: 0, 46=Int: 2, 47=UTF8: (1)'k', 48=UTF8: (2)'d1', 49=UTF8: (58)'WEIRD_UNICODE_SEQUENCE', 50=UTF8: (2)'d2', 51=UTF8: (0)'', 52=UTF8: (10)'Decompiler', 53=UTF8: (17)'DecompilerMain.kt', 54=UTF8: (4)'Code', 55=UTF8: (18)'LocalVariableTable', 56=UTF8: (15)'LineNumberTable', 57=UTF8: (13)'StackMapTable', 58=UTF8: (36)'RuntimeInvisibleParameterAnnotations', 59=UTF8: (10)'SourceFile', 60=UTF8: (20)'SourceDebugExtension', 61=UTF8: (25)'RuntimeVisibleAnnotations'}
AccessFlags: {ACC_PUBLIC, ACC_FINAL, ACC_SUPER}

And here is the Unicode-Sequence opened in Sublime Text:

My Questions about this whole thing are: Why is this Unicode breaking the console in IntelliJ IDEA, is this common in Kotlin-Class-Files, and what could one do to remove all such "escape sequences" from a String before printing it?

The unicode is not messing up your terminal. The constant-pool of a class file is not a string, it is a block of binary data which also happens to contain some strings embedded in it, among various other bytes that either correspond to non-printable characters, or do not even correspond to unicode characters. On your terminal you are seeing exactly what you should see when printing non-printable characters or sequences of bytes that make no sense in unicode. I am afraid you are going to need to show us your code if you need any more help. — Mike Nakis, Nov 23 '18 at 19:27
Actually I am not printing the raw binary data. I have read the binary data like described in the documentation on class-files by oracle. What you're seeing is basically just UTF-8 decoded Byte-Arrays. The other UTF-8-strings print fine too, as you might have seen. Only this one UTF-8 String didn't print properly and seemed to have broken the CLI on IntelliJ IDEA. — Ian Rehwinkel, Nov 23 '18 at 19:32

Mike Nakis · Answer 1 · 2018-11-27T13:30:19.433

5

For some unfathomable reason, when Sun Microsystems were designing Java, they decided to encode strings in the constant pool using an encoding that is not UTF8. It is a custom encoding used only by the java compiler and by classloaders.

Adding insult to injury, in the JVM documentation they decided to call this UTF8. But it is not UTF8, and their choice of name causes a lot of unnecessary confusion. So, what I am speculating here is that you saw that they call it UTF8, so you are treating it like real UTF8, and you are receiving garbage as a result.

You will need to look for the description of CONSTANT_Utf8_info in the JVM specification and write an algorithm that decodes strings according to their specification.

For your convenience, here is some code that I have written doing just that:

public static char[] charsFromBytes( byte[] bytes )
{
    int t = 0;
    int end = bytes.length;
    for( int s = 0;  s < end;  )
    {
        int b1 = bytes[s] & 0xff;
        if( b1 >> 4 >= 0 && b1 >> 4 <= 7 ) /* 0x0xxx_xxxx */
            s++;
        else if( b1 >> 4 >= 12 && b1 >> 4 <= 13 ) /* 0x110x_xxxx 0x10xx_xxxx */
            s += 2;
        else if( b1 >> 4 == 14 ) /* 0x1110_xxxx 0x10xx_xxxx 0x10xx_xxxx */
            s += 3;
        t++;
    }
    char[] chars = new char[t];
    t = 0;
    for( int s = 0;  s < end;  )
    {
        int b1 = bytes[s++] & 0xff;
        if( b1 >> 4 >= 0 && b1 >> 4 <= 7 ) /* 0x0xxx_xxxx */
            chars[t++] = (char)b1;
        else if( b1 >> 4 >= 12 && b1 >> 4 <= 13 ) /* 0x110x_xxxx 0x10xx_xxxx */
        {
            assert s < end : new IncompleteUtf8Exception( s );
            int b2 = bytes[s++] & 0xff;
            assert (b2 & 0xc0) == 0x80 : new MalformedUtf8Exception( s - 1 );
            chars[t++] = (char)(((b1 & 0x1f) << 6) | (b2 & 0x3f));
        }
        else if( b1 >> 4 == 14 ) /* 0x1110_xxxx 0x10xx_xxxx 0x10xx_xxxx */
        {
            assert s < end : new IncompleteUtf8Exception( s );
            int b2 = bytes[s++] & 0xff;
            assert (b2 & 0xc0) == 0x80 : new MalformedUtf8Exception( s - 1 );
            assert s < end : new IncompleteUtf8Exception( s );
            int b3 = bytes[s++] & 0xff;
            assert (b3 & 0xc0) == 0x80 : new MalformedUtf8Exception( s - 1 );
            chars[t++] = (char)(((b1 & 0x0f) << 12) | ((b2 & 0x3f) << 6) | (b3 & 0x3f));
        }
        else
            assert false;
    }
    return chars;
}

edited Nov 27 '18 at 13:30

answered Nov 23 '18 at 19:45

Mike Nakis

56,297
11
110
142

1

Holy cow. Do you know if this is true in the current versions of Java? Is this just a problem with Oracle/OpenJDK specific implementations? In other words, is the implementation ignoring/violating the JVM specification as written? – Basil Bourque Nov 23 '18 at 19:53
I can confirm that this is true up to and including version '52' (the version number in the bytecode) which is the last version that I checked, (probably corresponds to Java 9,) and I would be willing to bet that it is still the same. They are probably stuck with it for reasons of backwards compatibility, so it will probably stay like that forever. It is not a problem in anyone's specific implementation, because the JVM specification requires strings to be encoded this way. – Mike Nakis Nov 23 '18 at 19:57
I have implemented your code snippet into my code, but the error persists. There is a difference in the output though: some of the strange unicode characters disapperead. – Ian Rehwinkel Nov 23 '18 at 20:12
Hmmm, well, lots of things could be going wrong with how you are reading the constant pool. There are many pitfalls. I do not know how to help without seeing your code. – Mike Nakis Nov 23 '18 at 20:16
1

Actually, the JVM specification refers to this as "Modified UTF-8", *not* "UTF-8". And the reasons are far from unfathomable - this dates back to when UTF16 was ascendant, and avoiding null bytes is perfectly understandable. – Antimony Nov 23 '18 at 22:04
@Antimony thank you for the input. Yes, the word "modified" is used somewhere. Once. Then the data type is referred to everywhere as `CONSTANT_Utf8_info` not as `CONSTANT_MODIFIED_Utf8_info`. – Mike Nakis Nov 24 '18 at 01:29
First, misinterpreting modified UTF-8 as UTF-8 rarely has such dramatic consequences. It only differs in how `U+0000` and characters outside the BMP are encoded, neither being very common. And a decoder enforcing strict UTF-8 would simply throw an exception when encountering these mismatches. And you can simply use existing decoding rountien like [`DataInputStream.readUTF`](https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/io/c(java.io.DataInput)), which are documented to read *Modified UTF-8*. Besides that, the way this code abuses `assert`, is horrible. – Holger Nov 26 '18 at 07:43
@Holger, this code does not abuse assert. This code makes the most awesome use of assert you have ever seen. But you are right on all other accounts. – Mike Nakis Nov 26 '18 at 17:24
The fact that the reader needs a lot of time to understand what these statements will actually do, doesn’t makes them “awesome”. And they are inconsistent. Apparently, they are supposed to check conditions, which should have been checked and flagged previously, but they don’t tell that. Instead, they only provide an exception as their cause, not a hint that this exception should have been thrown at a earlier time. To make matters worse, the `IncompleteUtf8Exception` *is* never thrown in the previous loop. Not to speak of `assert false;` There is no reason to allow turning of the error here. – Holger Nov 27 '18 at 07:50
Generally, assertions should be only used to state invariants of your code, not to check input. Actually, this seems to be the intention of the code, as *some* conditions are checked and answered with an unconditional exception in the first part, so it would be valid to use assertions where they should be impossible to encounter again, but when I pass, e.g. `new byte[] { -61, 'a' }` as input, it will not throw but produce some result when assertions are turned off. So in the end, the resulting behavior is just inconsistent. – Holger Nov 27 '18 at 07:56
You are right in that the first part which calculates the length of the output array did in fact behave differently from the second part with respect to exceptions, so I just fixed that. The first part now just does not do any checking or throwing. – Mike Nakis Nov 27 '18 at 13:31
The rest is error checking and exception throwing exactly as I intend it to be. – Mike Nakis Nov 27 '18 at 13:32

score 4 · Answer 2 · answered Nov 23 '18 at 22:02

Mike's answer already covered the fact that Java classfiles don't quite use UTF8 encoding, but I figured I would provide more information about it.

The encoding used in Java classfiles is called Modified UTF-8 (or MUTF-8). It differs from regular UTF-8 in two ways:

The null byte is encoded using a two byte sequence
Code points outside the BMP are represented with a surrogate pair as in UTF16. Each code point in the pair is in turn encoded in three bytes using the regular UTF8 encoding.

The first change is so that encoded data does not contain raw null bytes, which makes things easier to process when writing C code. The second change is a consequence of the fact that back in the 90s, UTF-16 was all the rage and it wasn't clear that UTF-8 would eventually win out. In fact, Java uses 16 bit characters for a similar reason. Encoding astral characters with surrogate pairs makes things much easier to handle in a 16bit world. Note that Javascript, designed around the same time, has similar issues with UTF-16 strings.

Anyway, encoding and decoding MUTF-8 is pretty easy. It's just annoying since it isn't builtin anywhere. When decoding, you decode in the same way as UTF-8, you just have to be more tolerant and except sequences that are technically not valid UTF-8, (despite using the same encoding), and then replace surrogate pairs as applicable. When encoding, you do the reverse.

Note that this applies only to Java bytecode. Programmers in Java will typically not have to deal with MUTF-8 as Java uses a mixture of UTF-16 and true UTF-8 everywhere else.

“It's just annoying since it isn't builtin anywhere.” Except, it’s builtin since Java 1.0., See [`DataInputStream.readUTF`](https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/io/DataInputStream.html#readUTF(java.io.DataInput)) and [`DataOutputStream.writeUTF`](https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/io/DataOutputStream.html#writeUTF(java.lang.String)). The documentation of the `DataInput` interface also [describes the Modified UTF-8 format](https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/io/DataInput.html#modified-utf-8). — Holger, Nov 26 '18 at 08:06
@Holger, Sorry I was referring to Python, which has no native MUTF8 support. Just because you're writing tools that parse classfiles doesn't mean you have to actually use Java. — Antimony, Nov 27 '18 at 04:46
So it should have been “…isn't builtin anywhere *except Java*”. That’s likely to be true, as other languages have no reason to provide such support. Sometimes, you got away with using a UTF-8 decoder which doesn’t reject these forms (before Unicode 3.1, it even wouldn’t be wrong), but it ran on razor’s edge, as the next library update could tighten the checks. It even [happened to Java](https://stackoverflow.com/q/25404373/2711488). — Holger, Nov 27 '18 at 08:10

score 3 · Accepted Answer · answered Nov 26 '18 at 11:32

IntelliJ’s console most likely interprets certain characters of the string as control characters (compare to Colorize console output in Intellij products).

Most likely, it will be an ANSI terminal emulation, which you can verify easily by executing

System.out.println("Hello "
    + "\33[31mc\33[32mo\33[33ml\33[34mo\33[35mr\33[36me\33[37md"
    + " \33[30mtext");

If you see this text printed using different colors, it’s an ANSI terminal compatible interpretation.

But it’s always a good idea to remove control characters when printing strings from an unknown source. The string constants from a class file are not required to have human readable content.

A simple way to do this, is

System.out.println(string.replaceAll("\\p{IsControl}", "."));

which will replace all control characters with a dot before printing.

If you want to get some diagnostic regarding the actual char value, you could use, e.g.

System.out.println(Pattern.compile("\\p{IsControl}").matcher(string)
    .replaceAll(mr -> String.format("{%02X}", (int)string.charAt(mr.start()))));

This requires Java 9, but of course, the same logic can be implemented for earlier Java version as well. It would only require a bit more verbose code.

The Pattern instance returned by Pattern.compile("\\p{IsControl}") can be stored and reused.

Java print unicode glitch

3 Answers3