Java String internal representation

Question

I understand that the internal representation of Java for String is UTF-16. What is java string representation?

Also, I know that in a UTF-16 String, each 'character' is encoded with one or two 16-bit code units.

However, when I debug the following java code

String hello = "Hello";

the variable hello is an array of 5 bytes 0x48, 0x101, 0x108, 0x108, 0x111 which is ASCII for "Hello".

How can this be?

How are you debugging this? It's just a char array of characters. — Ferrybig, Jan 27 '16 at 08:22
[link](http://postimg.org/image/udpk662y5/) This is a screenshot from my Intellij debugger. Yes, Ferrybig - it is an array of characters — Yoaz Menda, Jan 27 '16 at 08:31
Thanks for you quick response guys. However, I still fail to understand - each of these chars seems to be one byte and not 2/4 as should be in UTF-16 — Yoaz Menda, Jan 27 '16 at 08:33
How do you know that? The IntelliJ IDEA debugger does not show how many bytes are used to store a `char` value. — yole, Jan 27 '16 at 08:37

Rob Audenaerde · Accepted Answer · 2016-01-27T09:53:50.717

3

I took a gcore dump of a mini java process with this code:

 class Hi {
    public static void main(String args[]) {
        String hello = "Hello";
        try {
            Thread.sleep(60_000);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }

    }
}

And did a gcore memory dump on Ubuntu. (usign jps to get the pid and passed that to gcore)

If found this: 48 65 6C 6C 6F in the dump using a Hexeditor, so it is somewhere in the memory as ASCII.

But also 48 00 65 00 6C 00 6C which is part of the UTF-16 representation of the String

edited Jan 27 '16 at 09:53

answered Jan 27 '16 at 08:34

Rob Audenaerde

19,195
10
76
121

1

Yes, it's in ASCII (or rather UTF8) in the constant pool of the compiled .class file. – yole Jan 27 '16 at 08:36
alright, so this answer, with conjunction with @yole's comment above (Intellij debugger is somewhat now necessarily showing the size of each char) answers the question. thank you! – Yoaz Menda Jan 27 '16 at 08:46

score 2 · Answer 2 · answered Jan 27 '16 at 10:00

String internal representation is not specified, it's the implementation detail, so you cannot rely on it. It's very likely that in JDK-9 it will be changed to use double encoding (Latin-1 for strings which can be encoded in Latin-1, UTF-16 for other strings). See JEP-254 for details. This feature is already integrated in OpenJDK master codebase, so if you are using Java-9 early access builds, you will have actually 5 bytes.

Java String internal representation

2 Answers2

Linked