Is specifying String encoding when parsing byte[] really necessary?

Question

Supposedly, it is "best practice" to specify the encoding when creating a String from a byte[]:

byte[] b;
String a = new String(b, "UTF-8"); // 100% safe
String b = new String(b); // safe enough

If I know my installation has default encoding of utf8, is it really necessary to specify the encoding to still be "best practice"?

A very good answer to this question is given as a corollary of: http://stackoverflow.com/questions/12659417/why-does-javas-string-getbytes-uses-iso-8859-1 — Dean Povey, Jan 01 '14 at 08:05
You can set default charset for jvm please refer [link] http://stackoverflow.com/a/623036/3131537 — Darshan Patel, Jan 01 '14 at 08:48

score 3 · Answer 1 · answered Jan 01 '14 at 08:01

Different use cases have to be distinguished here: If you get the bytes from an external source via some protocol with a specified encoding then always use the first form (with explicit encoding).

If the source of the bytes is the local machine, for example a local text file, the second form (without explicit encoding) is better.

Always keep in mind, that your program may be used on a different machine with a different platform encoding. It should work there without any changes.

Stephen C · Accepted Answer · 2014-01-04T01:51:31.180

1

If I know my installation has default encoding of utf8, is it really necessary to specify the encoding to still be "best practice"?

But do you know for sure that your installation will always have a default encoding of UTF-8? (Or at least, for as long as your code is used ...)

And do you know for sure that your code is never going to be used in a different installation that has a different default encoding?

If the answer to either of those is "No" (and unless you are prescient, it probably has to be "No") then I think that you should follow best practice ... and specify the encoding if that is what your application semantics requires:

If the requirement is to always encode (or decode) in UTF-8, then use "UTF-8".
If the requirement is to always encode (or decode) in using the platform default, then do that.
If the requirement is to support multiple encodings (or the requirement might change) then make the encoding name a configuration (or command line) parameter, resolve to a Charset object and use that.

The point of this "best practice" recommendation is to avoid a foreseeable problem that will arise if your platform's characteristics change. You don't think that is likely, but you probably can't be completely sure about it. But at the end of the day, it is your decision.

(The fact that you are actually thinking about whether "best practice" is appropriate to your situation is a GOOD THING ... in my opinion.)

edited Jan 04 '14 at 01:51

answered Jan 01 '14 at 11:05

Stephen C

698,415
94
811
1,216

IMO even if the code would never be used on any other platform than the one having default encoding of UTF-8, it would still be worthwile to specify it - it is more readable and clearer to understand that way, and one really should consider the readers of the code when writing it. – eis Jan 02 '14 at 22:59
Your readability point is debatable. You are trading off "clarity" versus "redundant code". (I'm inclined to agree with you though, but that's just my personal bias.) My reason for not mentioning this is that the fragility / portability issue that underlies the "best practice" recommendation is much stronger. – Stephen C Jan 02 '14 at 23:36
but if the answer is "yes" to both, then you're OK with leaving it out. – Bohemian Jan 03 '14 at 11:34
If you can prove to me that you are prescient, I'll be willing to trust those "yes" answers. Then I'll be happy with leaving it out. :-) – Stephen C Jan 03 '14 at 12:02
Seriously, this comes down to a trade-off between convenience and guarding against something that *might or might not happen*. It is not really relevant whether >>I<< am OK with it. As a professional, it is really up to you to decide, in context. ("Best practice" dogma is only a guide.) – Stephen C Jan 04 '14 at 01:50

Is specifying String encoding when parsing byte[] really necessary?

2 Answers2