1

OS default encoding: UTF-8

convert UTF-8 str to UTF-16 in python:

utf8_str = "Hélô" # type(utf8_str) is str and encoded in UTF-8
unicode_str = utf8_str.decode("UTF-8") # type(unicode_str) is unicode
utf16_str = unicode_str.encode("UTF-16") #type(utf16_str) is str and encoded in UTF-16

As you can see, unicode is bridge of converting utf-8 str to utf-16 str, and it's easy to understand.

But, in java, I am confused about the conversion:

String utf16Str = "Hélô";// String encoded in "UTF-16"
byte[] bytes = utf16Str.getBytes("UTF-8");//byte array encoded in UTF-8, getBytes will call a encode method.
String newUtf16Str = new String(bytes, "UTF-8");// String encoded in "UTF-16"

There is no decode, no unicode. So, what happened in this process?

expoter
  • 1,622
  • 17
  • 34
  • Java is open source so if you wanted to, you could just look at the code and see what it actually does. It's already on your filesystem! – Jeroen Vannevel Jul 15 '16 at 08:20
  • I know, but it would be easier to understand the source code if there are some advice from expert. – expoter Jul 15 '16 at 08:23
  • 2
    The fact that strings are stored in UTF-16 by the JVM internally (which by the way is not necessarily the case) is somewhat irrelevant. From the language perspective, a String is a String and does not have an encoding - you can convert between a String and a byte[] using the two methods you have used (which are standard encoding/decoding operations). What is your real question? – assylias Jul 15 '16 at 08:28
  • @assylias, oh, Your answer is very helpful. I thought String in java is encoded in UTF-16. if it has no encoding, then the conversion using getBytes make sense. One more question, If String have no encoding, is it represented in unicode which we call code point and what's the relationship between UTF-16 and String? After all, str in python have an encoding(UTF-8, ascii, etc). – expoter Jul 15 '16 at 08:42
  • 1
    Yes a string is a sequence of unicode code points, which can be encoded in a number of encodings such as UTF-16 or UTF-8 etc. This is maybe a helpful post: http://stackoverflow.com/a/33358306/829571 – assylias Jul 15 '16 at 09:00

0 Answers0