0

A related question Android default character encoding mentions that the "default character encoding" for android is UTF-8 and strings are in UTF-16. A user Virus721 asked this in comments, but there was no proper reply.

Charset also mentions this. It says that "native character encoding" for Java is UTF-16.

What is the difference between "default character encoding" and "native character encoding"? In the context of Android and Java why does the documentation say that UTF-8 is "default character encoding" and UTF-16 is "native character encoding"?

1 Answers1

2

Java String objects are always encoded as UTF-16. (*) This is the "native character encoding".

When converting text to a byte stream then some specific encoding must be chosen and different operating systems and their configurations have different preferences on how that is done.

Java introduces the concept of "default character encoding" which tries to represent "the character encoding that the underlying operating system considers the default".

On Android that "default character encoding" is UTF-8 (luckily this is an increasingly common default).

Java APIs (and thus Android APIs which are built on top of or using Java APIs) often use the default character encoding whenever a String needs to be converted to a byte stream (such as when writing to a file or a network connection) and no explicit character encoding is provided.

(*) Well, there's caveats and exceptions, but those are not usually user-visible. For example JDK9 supports compact strings where String objects that contain only ISO-8859-1 encodeable characters actually store only 8bit per character instead of 16. However this optimization (as well as a similar one implemented in newer Android versions) don't change any return values of String, so they are transparent to the developers.

Joachim Sauer
  • 302,674
  • 57
  • 556
  • 614
  • Thanks. Accepting this as the answer. Just out of curiosity: Is this notion (native vs default encoding) universal or Java-specific? That is, Native for Dotnet is UTF-16 and Rust is UTF-8. But do they too have equivalent of "System.getProperty("file.encoding")", to use whenever a String needs to be converted to a byte stream? – Jayadevan Vijayan Dec 13 '19 at 15:39
  • 1
    @JayadevanVijayan: I don't know either system well enough to know for sure, but I wouldn't be surprised if they had something similar. In the C/UNIX world the equivalent of the "default character encoding" would be to get the encoding of the currently configured locale, so even if a language/system doesn't have a direct "default character encoding" there would certainly be a way to query the current locale. – Joachim Sauer Dec 13 '19 at 16:37