21

from java.lang.StringCoding :

String csn = (charsetName == null) ? "ISO-8859-1" : charsetName;

This is what is used from Java.lang.getBytes() , in linux jdk 7 I was always under the impression that UTF-8 is the default charset ?

Thanks

Amnon
  • 1,241
  • 3
  • 10
  • 19
  • It does not - see answers below – Mr_and_Mrs_D Apr 05 '13 at 20:06
  • Encoding is hard to predict - it's different charsets on Centos 6 and Centos 7 and Oracle JDK and Open JDK - you should NEVER rely on default charset. I don't understand why someone would expect UTF-8 even if it's so popular - I believe Java uses UTF-16 internally. – Boris Treukhov Jul 01 '16 at 11:56

4 Answers4

41

It is a bit complicated ...

Java tries to use the default character encoding to return bytes using String.getBytes().

  • The default charset is provided by the system file.encoding property.
  • This is cached and there is no use in changing it via the System.setProperty(..) after the JVM starts.
  • If the file.encoding property does not map to a known charset, then the UTF-8 is specified.

.... Here is the tricky part (which is probably never going to come into play) ....

If the system cannot decode or encode strings using the default charset (UTF-8 or another one), then there will be a fallback to ISO-8859-1. If the fallback does not work ... the system will fail!

.... Really ... (gasp!) ... Could it crash if my specified charset cannot be used, and UTF-8 or ISO-8859-1 are also unusable?

Yes. The Java source comments state in the StringCoding.encode(...) method:

// If we can not find ISO-8859-1 (a required encoding) then things are seriously wrong with the installation.

... and then it calls System.exit(1)


So, why is there an intentional fallback to ISO-8859-1 in the getBytes() method?

It is possible, although not probable, that the users JVM may not support decoding and encoding in UTF-8 or the charset specified on JVM startup.

Then, is the default charset used properly in the String class during getBytes()?

No. However, the better question is ...


Does String.getBytes() deliver what it promises?

The contract as defined in the Javadoc is correct.

The behavior of this method when this string cannot be encoded in the default charset is unspecified. The CharsetEncoder class should be used when more control over the encoding process is required.


The good news (and better way of doing things)

It is always advised to explicitly specify "ISO-8859-1" or "US-ASCII" or "UTF-8" or whatever character set you want when converting bytes into Strings of vice-versa -- unless -- you have previously obtained the default charset and made 100% sure it is the one you need.

Use this method instead:

public byte[] getBytes(String charsetName)

To find the default for your system, just use:

Charset.defaultCharset()

Hope that helps.

The Coordinator
  • 13,007
  • 11
  • 44
  • 73
  • 1
    if you follow the flow of the getBytes() (No chatset supplied)you will see that's it's trying the fetch the default charset and if not found returns a "UTF-8" but as you can see from the code above there's a different logic in stringcodes that defaults to ISO-8859-1 if not supplied , that's a conflict ... i know you can pass the charset the question was why it does not default to utf-8 – Amnon Oct 01 '12 at 11:48
  • That behaviour is specified in the javadoc. I will amend my answer to post it clearly. – The Coordinator Oct 01 '12 at 18:10
  • 1
    it's not :) that's my point the javadoc states : " Encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array." and that's where the problem , there are scenarios where Java will not use the default charset – Amnon Oct 02 '12 at 06:22
  • You have a good point. Because if there is a default Charset, then it should use that for converting. You, my friend, have found a bug! – The Coordinator Oct 02 '12 at 08:19
  • Maybe delete this last comment of yours then ? :D – Mr_and_Mrs_D Apr 17 '14 at 15:50
  • Let it be known, for the record, that this is not a bug! – The Coordinator May 04 '14 at 07:03
  • "The behavior of this method when this string cannot be encoded in the default charset is unspecified.". Isn't this a problem? If the data is user-controlled, sending invalid UTF-8 could take down the application. – bcoughlan Aug 12 '14 at 13:35
  • @bcoughlan No, an invalid sequence will only throw and Exception. However, the lack of a default encoding will actually stop the app dead. Very improbable, unless the system was hacked and bugged to start with. – The Coordinator Oct 01 '14 at 08:24
13

The parameterless String.getBytes() method doesn't use ISO-8859-1 by default. It will use the default platform encoding, if that can be determined. If, however, that's either missing or is an unrecognized encoding, it falls back to ISO-8859-1 as a "default default".

You should very rarely see this in practice. Normally the platform default encoding will be detected correctly.

However, I'd strongly suggest that you specify an explicit character encoding every time you perform an encode or decode operation. Even if you want the platform default, specify that explicitly.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
5

That's for compatibility reason.

Historically, all java methods on Windows and Unix not specifying a charset were using the common one at the time, that is "ISO-8859-1".

As mentioned by Isaac and the javadoc, the default platform encoding is used (see Charset.java) :

594    public static Charset defaultCharset() {
595        if (defaultCharset == null) {
596            synchronized (Charset.class) {
597                String csn = AccessController.doPrivileged(
598                    new GetPropertyAction("file.encoding"));
599                Charset cs = lookup(csn);
600                if (cs != null)
601                    defaultCharset = cs;
602                else
603                    defaultCharset = forName("UTF-8");
604            }
605        }
606        return defaultCharset;
607    }

Always specify the charset when doing string to bytes or bytes to string conversion.

Even when, as is the case for String.getBytes() you still find a non deprecated method not taking the charset (most of them were deprecated when Java 1.1 appeared). Just like with endianness, the platform format is irrelevant, what is relevant is the norm of the storage format.

Denys Séguret
  • 372,613
  • 87
  • 782
  • 758
  • 1
    Not entirely true. On IBM's OS/390 (later named z/OS), text files are encoded in EBCDIC and not ASCII; therefore the default platform encoding there wasn't ISO-8859-1, but some EBCDIC-based encoding (say EBCDIC 0037). – Isaac Sep 30 '12 at 07:41
  • AFAIK Methods not taking charset are not deprecated, they should just use the default charset , no ? i understand it's probably a "legacy" code , shouldn't this be a bug tough ? – Amnon Oct 01 '12 at 11:53
3

Elaborate on Skeet's answer (which is of course the correct one)

In java.lang.String's source getBytes() calls StringCoding.encode(char[] ca, int off, int len) which has on its first line :

String csn = Charset.defaultCharset().name();

Then (not immediately but absolutely) it calls static byte[] StringEncoder.encode(String charsetName, char[] ca, int off, int len) where the line you quoted comes from - passing as the charsetName the csn - so in this line the charsetName will be the default charset if one exists.

Mr_and_Mrs_D
  • 32,208
  • 39
  • 178
  • 361