1

I have run into such a java String where the following is false:

body.equals(new String(body.getBytes()));

I suppose this is because the String constructor is by default treating the encoding of the body byte[] as UTF-8, I'm not 100% sure. How would I be able to store this string in a byte[] and be able to convert it back later? I suppose I need to be able to determine what encoding the byte[] is in. How would I do this?

Some context: I need the byte[] so I can compress the data, store it in a db, and later uncompress and turn the uncompressed byte[] back into the original string. The string originally comes from some library which downloaded a webpage, and i'm not sure what processing they do on the string before handing it to me.

  • possible duplicate of [What is character encoding and why should I bother with it](http://stackoverflow.com/questions/10611455/what-is-character-encoding-and-why-should-i-bother-with-it) – Raedwald Apr 10 '15 at 12:30

3 Answers3

2

The platform default charset is used to encode and decode.

The problem is, that charset might be limited, e.g. US-ASCII. If a char in the string is outside that charset, we'll lose it.

Use a charset that covers all unicode chars, e.g. UTF-8, UTF-16.

irreputable
  • 44,725
  • 9
  • 65
  • 93
1

Just make sure that you use the same charset both ways - when creating the byte array from the String and when creating the String from the byte array.

So you example would be better as:

body.equals(new String(body.getBytes("utf-8"), "utf-8"));

This will guarantee, no matter what the environment, that the bytes will be understood.

You should also, almost unquestionably, be using unicode. If you choose a single-byte encoding (e.g. an ISO code-page) you will likely regret it in future, even if there is a single-byte encoding that satisfies your needs right now.

joelittlejohn
  • 11,665
  • 2
  • 41
  • 54
  • This should already be the case as the documentation for the constructor and `getBytes` both say they will use the default charset which won't change once the VM has started and cached what the default charset is. – Dunes Oct 16 '12 at 22:10
  • @Dunes, true, although I was assuming that the actual example line of code never appears anywhere in a real application - it's simple a short line that shows both the correct constructor to use and the correct `getBytes` method to call. In practice, I expect these two calls are separated by time and a round trip to a persistent store. In which case, it's far safer (in any environment/platform) to supply charset in both calls and not rely on the platform default. You're absolutely right though that you'd never need to do so if you genuinely were using this exact line in production. – joelittlejohn Oct 17 '12 at 08:54
1

When converting between bytes and characters without specifying an encoding, the behavior is platform-dependent. The default encoding is used which is JVM-wide and depends on your system. I don't know exactly what will happen if the encoding is ASCII and you have some non-ASCII characters but I know you will get a different string. You need to specify the encoding every time you concert to avoid this.

John Watts
  • 8,717
  • 1
  • 31
  • 35