-2

What is the meaning of the following?

String s = "some text here";
byte[] b = s.getBytes("UTF-8");

Does it mean, that the content in b is now encoded with UTF-8 or that we just got simple 0,1 Bytes from a string, which was encoded in UTF-8? Aren't all strings in java encoded in utf-16? What is the Java's internal represention for String? Modified UTF-8? UTF-16?

Sometimes I see the following too:

byte ptext[] = myString.getBytes("ISO-8859-1"); 
String value = new String(ptext, "UTF-8"); 
Community
  • 1
  • 1
Gero
  • 12,993
  • 25
  • 65
  • 106

2 Answers2

1

b is the sequence of bytes that represents, in the UTF-8 encoding, the string "some text here". String uses UTF-16 internally. Charsets generally are ways to convert between sequences of bytes and strings.

byte ptext[] = myString.getBytes("ISO-8859-1"); 
String value = new String(ptext, "UTF-8"); 

This looks like a hack taking advantage of two charsets having the same encoding in some specific cases, and is generally inadvisable.

Louis Wasserman
  • 191,574
  • 25
  • 345
  • 413
1

So a Java String is internally stored as char[]. Each char is 16 bits, and represents a Unicode character. When you need to obtain a byte array for that String, you need to tell the JVM how to encode those bytes. The getBytes(Chatset) method allows you to do that. The arg-less getBytes() method simply uses Charset.defaultCharset(). Depending on the encoding you choose (but you should choose the correct one) you may get a different count of bytes.

You can read more here: Byte Encodings and Strings.

omerkudat
  • 9,371
  • 4
  • 33
  • 42
  • Java strings are UTF-16 encoded. Only _some_ Unicode characters can be encoded as a single `char` element in a string. Others must be encoded as a _surrogate pair_. Try this out for fun: `System.out.println("".length());` – Solomon Slow Sep 08 '15 at 21:35