3

What is difference between

String(s.getBytes("UTF-8"),"UTF-8"); 

and

String(s.getBytes(),"UTF-8");

With first code example some of the special characters are getting decoded why and what is the difference?

And will impact anything if I use double decoding with UTF-8 ?

jeff porter
  • 6,560
  • 13
  • 65
  • 123
  • 7
    1) `s.getBytes()` uses your JVM's default locale. So, the difference is potentially nothing, depending upon the contents of the string, and that default locale. – Andy Turner Jan 15 '19 at 12:08
  • 6
    I would suggest taking a step back: what are you attempting to achieve here? – Jon Skeet Jan 15 '19 at 12:09
  • 2
    Try to avoid code `s.getBytes()` especially when you are dealing with characters that can't fit into ascii code boundaries like Arabic, Chinese, Hindi etc. `s.getBytes()` will use the platform's default encoding where as I suggest you to always use `UTF-8` which is most smart and compact Unicode encoding. – Pushpesh Kumar Rajwanshi Jan 15 '19 at 12:12
  • @JonSkeet I am trying to decode UTF-8 encoding string and some of the characters i am unable to decode instead getting ? question mark. So i used String(s.getBytes("UTF-8"),"UTF-8"); then it is working fine – Jala Sureshreddy Jan 15 '19 at 12:12
  • `new String(s.getBytes(enc1), enc2)` will convert the original String `s` to a byte array using `enc1` (which in case of `getBytes()` would be the default encoding) and create a new string from that byte array which is decoded using `enc2`. That means that if the encodings are different there's a chance that some characters that are different/not supported by both lead to different results. - That said, creating a new string like that would be redundant anyways. If you get a byte array to create `s` then apply the correct encoding there. – Thomas Jan 15 '19 at 12:13
  • 3
    There's no such thing as a "UTF-8 encoding string". A string is just a string. If you started with a byte array, *then* it would make sense to convert that into a string using `new String(bytes, Charsets.UTF_8)`. Please provide more information in the question about what data you've got and how you received it. – Jon Skeet Jan 15 '19 at 12:17
  • And please note: dont forget about accepting an answer at some point, it looks like you rarely do that though ... – GhostCat Jan 15 '19 at 12:17
  • 1
    @JalaSureshreddy: Adding onto what Jon Skeet said: If `s` is a `String`, that means it is already decoded. If it contains `?` characters due to incorrect decoding, it's too late to fix those: no amount of encoding and decoding the string you have will bring back the string you started with. You need to find where the string is being decoded the *first* time and use the correct charset there. – Daniel Pryden Jan 15 '19 at 12:19
  • I appreciate the quick comeback! And welcome to upvote privileges, which allow you to show your appreciation for other answers, too! – GhostCat Jan 18 '19 at 11:01

3 Answers3

5

From the javadoc:

For getBytes():

Encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array.

Whereas, getBytes(Charset) says:

Encodes this String into a sequence of bytes using the given charset, storing the result into a new byte array.

So the second version allows you to take full control, the first call relies on that platform default charset.

That is all there is to this.

For that "platform default", see here for example. And note that people are asking to make the default to simply be UTF-8 across the board (see here).

GhostCat
  • 137,827
  • 25
  • 176
  • 248
  • my platform default is also UTF-8 only but String(s.getBytes(),"UTF-8") is unable to decode some special characters but by String(s.getBytes("UTF-8"),"UTF-8") it is able to decode so why it is like that. – Jala Sureshreddy Jan 15 '19 at 12:21
  • 1
    Then you should probably ask a new question, and give an exact [mcve] that shows your problem ;-) – GhostCat Jan 15 '19 at 12:35
3

So, you are asking about these two lines:

String s1 = new String(s.getBytes("UTF-8"), "UTF-8"); // line 1
String s2 = new String(s.getBytes(), "UTF-8"); // line 2

Both these lines are not doing anything useful. Line 2 is even worse than line 1; it might not just be useless, but wrong, depending on what the default character encoding of your system is.

Line 1 effectively does nothing. It encodes the string s into bytes using the UTF-8 character encoding, and then immediately decodes the bytes back into a string using UTF-8. The string s1 will always contain exactly the same as the original string; the encoding and decoding is useless.

What line 2 does, depends on the default character encoding that's being used on your system. If the default character encoding is UTF-8, then it does exactly the same as line 1. If it is something different than UTF-8, then you get an incorrectly decoded string.

Suppose that the default character encoding of your system is ISO-8859-1. Then line 2 encodes the string using ISO-8859-1, and then it immediately decodes the result as if it is UTF-8 - which is wrong. You might get a string with incorrectly decoded characters, or even an exception.

Read the API documentation of the methods you're using to understand what exactly they do:

Jesper
  • 202,709
  • 46
  • 318
  • 350
1

The two examples you included in your question are nonsense.

A Java String is stored in memory as an array of UTF-16 code points. It is too late to identify a byte[] as an array of UTF-8 code points after said array has already been converted to a String,

If you receive a byte[] and want to store it as a String, then it makes sense to do this:

//assume input byte[] kapow
String blammy = new String(kapow, StandardCharsets.UTF_8);

If you have a String value and want to write it to something as a byte[] with UTF-8 encoding, then this makes sense

// assume input String blammy 
byte[] kapow = blammy.getBytes(StandardCharsets.UTF_8);

Notice that in both cases, I used the (blah, Charset) version of the method. Do this. The (blah, "UTF-8") versions throw a checked exception. The (blan, Charset) versions never throw an exception and the StandardCharsets class does this (from the StandardCharsets JavaDoc page):

Constant definitions for the standard Charsets. These charsets are guaranteed to be available on every implementation of the Java platform.

DwB
  • 37,124
  • 11
  • 56
  • 82