String.getBytes() in different default charsets

Question

Is it safe to use String.getBytes() ? What happens when a program runs on different systems with different default charset? I suppose I can get different content byte[]? Is it possible to define preferred charset in Java 1.4?

Why are you using such an absurdly ancient version of Java? – Matt Ball Sep 30 '13 at 15:42 — Matt Ball, Sep 30 '13 at 15:42

score 18 · Answer 1 · edited May 23 '17 at 10:28

18

Is it safe to use String.getBytes() ?

No. You should always use the overload which specifies the charset; ideally using UTF-8 everywhere. If you were using a modern version of Java, your code could use StandardCharsets for Good Clean Living.

What will happens when program will run on different systems with different default charset?

Your code risks interpreting character data with the wrong encoding, resulting in broken/incorrect strings (for example: "ÃƒÂ®", "ÃƒÂ", "ÃƒÂ¼") and/or replacement characters (�).

Is it possible to define preferred charset in java 1.4?

No. The platform-default is, by definition, dictated by the platform, not your app.

edited May 23 '17 at 10:28

Community

1
1

answered Sep 30 '13 at 15:40

Matt Ball

354,903
100
647
710

`String.getBytes()` doesn't interpret the data - it *encodes* rather than *decoding*. Your answer looks more appropriate for the `String(byte[])` constructor. – Jon Skeet Sep 30 '13 at 15:45
By "your code," I meant the code which consumes the `byte[]` returned by `getBytes()`, but I can see what you mean given my answer's wording. – Matt Ball Sep 30 '13 at 15:48

Jon Skeet · Answer 2 · 2013-09-30T15:47:05.423

Is it safe to use String.getBytes() ?

It depends on what you mean by "safe". It will do exactly what you're trying to do.

What will happens when program will run on different systems with different default charset? I suppose I can get different content byte []?

Yes. Often you won't spot any difference if your string only contains ASCII, but even then there can be significant differences - e.g. in UTF-16 each character will take two bytes.

Is it possible to define preferred charset in java 1.4?

Not that I'm aware of. I don't know of a standard system property for this, for example. There may well be one for the specific implementation you're using, of course. It depends on your context. (You could set the file.encoding system property on the command line, for example. Whether or not that will affect the default encoding depends on the VM. It's not listed in System.getProperties.)

I would personally always specify the encoding you want to use, using the overloads which take a charset name or a Charset. On the rare occasions where you actually want to use the system default, just specify that explicitly (e.g. with Charset.defaultCharset).

score 1 · Answer 3 · edited Sep 30 '13 at 15:43

1

JavaDoc for getBytes():

Encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array.

Like MattBall said, it's best to define the charset each time using getBytes(Charset charset).

edited Sep 30 '13 at 15:43

arshajii

127,459
24
238
287

answered Sep 30 '13 at 15:43

telkins

10,440
8
52
79

score 1 · Answer 4 · answered Sep 30 '13 at 22:01

Answer to question 1: It is safe, as the world will not cease to exist if you use that. However, if you mean you want to get its bytes, then it is safe to use it, as long as you use its overload which specifies the used character encoding.

Answer to question 2: If you proceed correctly and specify the character encoding (UTF-8) preferred, then nothing special.

Answer to question 3: As characters are encoded differently in different character encodings, it is natural that their numeric representation highly depends on the used character encoding, therefore you might get different byte arrays for the same message if you are using more character encodings. This is why it is highly advisable to specify your character encoding and you will not have such issues.

Answer to question 4: It should be possible, but I am not a user of Java 1.4, so I am not able to test this for you.

Stephen C · Answer 5 · 2013-09-30T15:58:24.450

Is it safe to use String.getBytes()?

Under some circumstances, yes. For instance, it is (probably) safe if you know that the String's encoded form is only going to be used on the current host.

What will happens when program will run on different systems with different default charset?

It depends:

If the Strings only contain characters whose encoding is the same across the different character sets, then nothing will go wrong. For instance, if you only used simple (roman) letters and digits and "ordinary" punctuation, then it wouldn't matter if the default charset was ASCII, LATIN-1 or UTF-8.
If the encoded string data is created and consumed on the same system, then you should be OK too.
If is only a problem if the data is are interchanged. In that case, you could end up using the wrong encoding which will result in garbling when the encoded characters are decoded.

I suppose I can get different content byte []? Is it possible to define preferred charset in java 1.4?

If you know that the content encoding should be different to the default encoding, then you should use byte[] getBytes(Charset charset) or byte[] getBytes(String charsetName).

String.getBytes() in different default charsets

5 Answers5

Linked