0

Will the particular CharSet affect the binary (or integer) value associated with each byte?

Example:

String original = foo(); // makes string

byte[] utf8Bytes = original.getBytes("UTF8"); // CharSet is UTF8
byte[] defaultBytes = original.getBytes(); // default CharSet 

Will utf8Bytes[1] always equal defaultBytes[1] from a binary/integer point of value?

Kevin Meredith
  • 41,036
  • 63
  • 209
  • 384
  • It's good practice to *always* specify the charset. And if for some reason you really want the default charset, use Charset.defaultCharset() to make your intention clear. – dnault Apr 24 '13 at 20:48

1 Answers1

2

It will affect the values, and UTF-8 is not the default on all JVMs - it's a good idea to always use getBytes("UTF-8") to ensure that encoding is consistent.

For example, use getBytes("UTF-8") and getBytes("UTF-16") and compare the results (the latter will likely have twice as many bytes as the former)

Zim-Zam O'Pootertoot
  • 17,888
  • 4
  • 41
  • 69
  • If a method returns a `byte[]`, how do I know its `CharSet`? Example: `public static byte[] sha(byte[] data)` http://commons.apache.org/proper/commons-codec/apidocs/org/apache/commons/codec/digest/DigestUtils.html#sha%28byte[]%29 – Kevin Meredith Apr 24 '13 at 20:48
  • @Kevin That sha() method operates on an array of bytes. CharSets are irrelevant when operating on byte arrays (until you convert them to a String). – dnault Apr 24 '13 at 20:49
  • According to [this thread](http://stackoverflow.com/questions/499010/java-how-to-determine-the-correct-charset-encoding-of-a-stream), [this library](http://code.google.com/p/juniversalchardet/) can help - it's not guaranteed to be able to detect the encoding, but it usually can. – Zim-Zam O'Pootertoot Apr 24 '13 at 20:51
  • @Kevin You specified the charset when you converted a String to a byte array. But let's back up... what are you really trying to do? – dnault Apr 24 '13 at 20:52
  • @dnault Compare these two: `byte[] stringBytes = someString.getBytes("UTF-8")` and the result of `byte[] bytes = sha1(someBytes)`. I'm getting separate values when I compare each particular byte, but I expect them to match. That's why I am curious if I'm using the right encoding for `someBytes` – Kevin Meredith Apr 24 '13 at 20:53