48

I am actually confused regarding the encoding of strings in Java. I have a couple of questions. Please help me if you know the answer to them:

1) What is the native encoding of Java strings in memory? When I write String a = "Hello" in which format will it be stored? Since Java is machine independent I don't think the system will do the encoding.

2) I read on the net that "UTF-16" is the default encoding but I got confused because say when I write that int a = 'c' I get the number of the character in the ASCII table. So are ASCII and UTF-16 the same?

3) Also I wasn't sure on what the storage of a string in the memory depends: OS, language?

Matthias Braun
  • 32,039
  • 22
  • 142
  • 171
  • You should consider breaking these out into individual questions, as they are really very different. #2 can probably be answered here: http://stackoverflow.com/questions/1490218/utf-16-to-ascii-conversion-in-java – Ethel Evans Dec 15 '10 at 18:05

4 Answers4

43
  1. Java stores strings as UTF-16 internally.

  2. "default encoding" isn't quite right. Java stores strings as UTF-16 internally, but the encoding used externally, the "system default encoding", varies from platform to platform, and can even be altered by things like environment variables on some platforms.

    ASCII is a subset of Latin 1 which is a subset of Unicode. UTF-16 is a way of encoding Unicode. So if you perform your int i = 'x' test for any character that falls in the ASCII range you'll get the ASCII value. UTF-16 can represent a lot more characters than ASCII, however.

  3. From the java.lang.Character docs:

    The Java 2 platform uses the UTF-16 representation in char arrays and in the String and StringBuffer classes.

    So it's defined as part of the Java 2 platform that UTF-16 is used for these classes.

Laurence Gonsalves
  • 137,896
  • 35
  • 246
  • 299
  • The usage of char and char arrays is only defined for the public, external API for String and StringBuffer. The internal storage of the characters is implementation specific. – jarnbjo Dec 15 '10 at 20:24
  • @jarnbjo The above is a direct quote from the docs. The `char` datatype in Java represents a UTF-16 code unit (not a character, aka Unicode codepoint) so I think it's pretty safe to say that Java the language's representation of text is UTF-16. Yes, conceivably an implementation could choose to do something different under the covers, but in the end they'd have to make it look just like they were using UTF-16. – Laurence Gonsalves Dec 16 '10 at 00:33
  • Since there is no way to access the internal storage of the String and StringBuffer classes, it makes to sense to assume that the statement you quote apply to it. – jarnbjo Dec 16 '10 at 09:41
  • 2
    UTF-16BE or UTF-16LE ? – Hendy Irawan Oct 01 '17 at 13:10
  • 3
    @HendyIrawan Jana doesn't let you access the individual bytes, only the chars (which correspond to UTF-16 code units), so there is no set endian. The actual endian used in memory is JVM/platform dependent, just like the endian used to store an int in memory. – Laurence Gonsalves Oct 01 '17 at 15:34
20

1) Strings are objects, which typically contain a char array and the strings's length. The character array is usually implemented as a contiguous array of 16-bit words, each one containing a Unicode character in native byte order.

2) Assigning a character value to an integer converts the 16-bit Unicode character code into its integer equivalent. Thus 'c', which is U+0063, becomes 0x0063, or 99.

3) Since each String is an object, it contains other information than its class members (e.g., class descriptor word, lock/semaphore word, etc.).

ADENDUM
The object contents depend on the JVM implementation (which determines the inherent overhead associated with each object), and how the class is actually coded (i.e., some libraries may be more efficient than others).

EXAMPLE
A typical implementation will allocate an overhead of two words per object instance (for the class descriptor/pointer, and a semaphore/lock control word); a String object also contains an int length and a char[] array reference. The actual character contents of the string are stored in a second object, the char[] array, which in turn is allocated two words, plus an array length word, plus as many 16-bit char elements as needed for the string (plus any extra chars that were left hanging around when the string was created).

ADDENDUM 2
The case that one char represents one Unicode character is only true in most of the cases. This would imply UCS-2 encoding and true before 2005. But by now Unicode has become larger and Strings have to be encoded using UTF-16 -- where alas a single Unicode character may use two chars in a Java String.

Take a look at the actual source code for Apache's implementation, e.g. at:
http://www.docjar.com/html/api/java/lang/String.java.html

towi
  • 21,587
  • 28
  • 106
  • 187
David R Tribble
  • 11,918
  • 5
  • 42
  • 52
  • Actually what do you intend to say in your 3) part. It contains other information so .... ?? –  Dec 15 '10 at 19:00
  • "Assigning a character value to an integer converts the 16-bit Unicode character code into its integer equivalent." What's a little confusing here is that the Unicode encoding coincides with ASCII for the first 256 characters. Unicode correlates with Extended ASCII (8-bit) for the first 256 characters; Extended ASCII, in turn, corresponds directly with 7-bit ASCII for the first 128. So that 'c' is encoded as 0x63 in Unicode, Extended ASCII, and ASCII. This is why you'd see the int for 'c' and think it's ASCII (which it sortof is :). – Hawkeye Parker Nov 11 '14 at 23:53
  • @HawkeyeParker: Yes, 7-bit ASCII (ISO 646) and 8-bit ISO 8859-1 (Latin-1) are proper subsets of Unicode. That being said, Java encodes all character values as 16-bit Unicode. – David R Tribble Sep 16 '15 at 15:56
  • absolutely. I was just clarifying for those who might be confused by the overlap. – Hawkeye Parker Oct 09 '15 at 21:37
7

While this doesn't answer your question, it is worth noting that... In the java byte code (class file), the string is stored in UTF-8. http://java.sun.com/docs/books/jvms/second_edition/html/ClassFile.doc.html

Ralph
  • 118,862
  • 56
  • 287
  • 383
  • 1
    @Loadmaster I belive it is a useful information, and I explicite mentiond that it is the class file - so whats your probelm? – Ralph Dec 15 '10 at 18:22
  • 2
    But it doesn't answer the question. You could post it as a comment and begin with something like "While this doesn't answer your question, it is worth noting that..." This is indeed a useful piece of information, though, I had no idea they used UTF-8. What's the point? It means that JVM has to convert every string to UTF-16 on startup. – Sergei Tachenov Dec 15 '10 at 19:22
  • @Sergey Tachenov: Strings are stored as UTF-8 so that .class files are smaller (on average). – David R Tribble Dec 15 '10 at 20:50
  • This doesn't matter at all when you put them in a JAR file which you usually do. UTF-16 will be compressed almost twice as efficiently. – Sergei Tachenov Dec 16 '10 at 04:59
  • @Sergey Tachenov: It probably doesn't matter most of the time, but not everyone stores the contents of their JAR files in compressed form. Anyway, the (historical) reason I gave is what I gathered from reading about the `.class` file format. – David R Tribble Dec 17 '10 at 19:18
  • What if the `.class` file was created using `javac -encoding ISO-8859-1` option? Wouldn't all source file's content be stored in ISO-8859-1 rather than UTF-8? – parsecer Mar 01 '17 at 05:53
  • 1
    @parsecer: Oracel's documentation is quite strict about this "encoding : Set the source file encoding name, such as EUC-JP and UTF-8" - so this is only the source file (*.java) encoding, the encoding of Strings in *.class files keep UTF-8 – Ralph Mar 01 '17 at 11:25
1

Edit : thanks to LoadMaster for helping me correcting my answer :)

1) All internal String processing is made in UTF-16.

2) ASCII is a subset of UTF-16.

3) Internally in Java is UTF-16. For the rest, it depends on where you are, yes.

LaGrandMere
  • 10,265
  • 1
  • 33
  • 41
  • 3
    Strings are stored internally (in memory) as `char[]`, each element containing a 16-bit UTF-16 Unicode character. UTF-8 is not used to store strings internally, but is used for converting I/O streams to/from strings. – David R Tribble Dec 15 '10 at 18:12
  • @LoadMaster : has it changed during time ? Java was always internally in UTF-16 ? – LaGrandMere Dec 15 '10 at 18:27
  • 1
    Yes, `String` has always used an internal `char[]` to store its character values. – David R Tribble Dec 15 '10 at 20:48