5

I need to create a hash from a String containing users password. To create the hash, I use a byte array which I get by calling String.getBytes(). But when I call this method with specified encoding, (such as UTF-8) on a platform where this is not the default encoding, the non-ASCII characters get replaced by a default character (if I understand the behaviour of getBytes() correctly) and therefore on such platform, I will get a different byte array, and eventually a different hash.

Since Strings are internally stored in UTF-16, will calling String.getBytes("UTF-16") guarantee me that I get the same byte array on every platform, regardless of its default encoding?

Luiggi Mendoza
  • 85,076
  • 16
  • 154
  • 332
Jardo
  • 1,939
  • 2
  • 25
  • 45
  • 2
    Yes, but what happened when you tried it? – Elliott Frisch Sep 16 '14 at 19:11
  • 1
    @ElliottFrisch: I don't know whether you meant that comment to be tongue-in-cheek, but "try it on every Java platform in the world" isn't really a viable approach. – Jon Skeet Sep 16 '14 at 19:16
  • @JonSkeet A little. A simple yes wouldn't fit (and isn't 100% accurate, as you noted). But OP did mention getting different results on some platforms with `getBytes()` (so OP could have tested there). Also, "UTF-16 LE" is distinct from "UTF-16" - but would be consistent if consistently used. – Elliott Frisch Sep 16 '14 at 19:21
  • @ElliottFrisch: It would be entirely feasible for UTF-16 to always use UTF-16, but the endianness to depend on the platform. That would be a bad way of specifying it, but feasible. Just because the OP was getting different results with the *default* encoding on some systems doesn't mean that such a difference would show up. Basically, I think it's a reasonable question. – Jon Skeet Sep 16 '14 at 19:22
  • @JonSkeet In the interest of full disclosure, my comment was very nearly; Yes, but you'll have to wait for Jon Skeet to explain why. – Elliott Frisch Sep 16 '14 at 19:26
  • @ElliottFrisch: LOL. Pre-emptive, but accurate in this case ;) – Jon Skeet Sep 16 '14 at 19:29
  • 1
    I am not sure where you get the idea that the bytes returned for UTF-8 would be different depending on the platform, because it wouldn't. You might want to show some code that makes you see those difference. – Mark Rotteveel Sep 16 '14 at 19:42
  • @Mark Rotteveel I wasn't sure whether UTF-8 would work or not, I used it just as an example. But I knew there are problems with other encodings so I fugured UTF-8 probably causes problems too because there has to be some conversion. But again, that was just an example, not a claim. – Jardo Sep 17 '14 at 11:40
  • this whole question is pointless. use UTF-8 and you have no endian issues. – jtahlborn Sep 17 '14 at 14:44

3 Answers3

4

Yes. Not only is it guaranteed to be UTF-16, but the byte order is defined too:

When decoding, the UTF-16 charset interprets the byte-order mark at the beginning of the input stream to indicate the byte-order of the stream but defaults to big-endian if there is no byte-order mark; when encoding, it uses big-endian byte order and writes a big-endian byte-order mark.

(The BOM isn't relevant when the caller doesn't ask for it, so String.getBytes(...) won't include it.)

So long as you have the same string content - i.e. the same sequence of char values - then you'll get the same bytes on every implementation of Java, barring bugs. (Any such bug would be pretty surprising, given that UTF-16 is probably the simplest encoding to implement in Java...)

The fact that UTF-16 is the native representation for char (and usually for String) is only relevant in terms of ease of implementation, however. For example, I'd also expect String.getBytes("UTF-8") to give the same results on every platform.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • Once more implementations don't seem to agree with specifications... :/ https://github.com/facebook/conceal/issues/138 – helios Apr 07 '16 at 16:54
1

It is true, java uses Unicode internally so it may combine any script/language. String and char use UTF-16BE but .class files store there String constants in UTF-8. In general it is irrelevant what String does, as there is a conversion to bytes specifying the encoding the bytes have to be in.

If this encoding of the bytes cannot represent some of the Unicode characters, a placeholder character or question mark is given. Also fonts might not have all Unicode characters, 35 MB for a full Unicode font is a normal size. You might then see a square with 2x2 hex codes or so for missing code points. Or on Linux another font might substitute the char.

Hence UTF-8 is a perfect fine choice.

String s = ...;
if (!s.startsWith("\uFEFF")) { // Add a Unicode BOM
    s = "\uFEFF" + s;
}
byte[] bytes = s.getBytes(StandardCharsets.UTF_8);

Both UTF-16 (in both byte orders) and UTF-8 always are present in the JRE, whereas some Charsets are not. Hence you can use a constant from StandardCharsets not needing to handle any UnsupportedEncodingException.

Above I added a BOM for Windows Notepad esoecially, to recognize UTF-8. It certainly is not good practice. But as a small help here.

There is no disadvantage to UTF16-LE or UTF-16BE. I think UTF-8 is a bit more universally used, as UTF-16 also cannot store all Unicode code points in 16 bits. Text is Asian scripts would be more compressed, but already HTML pages are more compact in UTF-8 because of the HTML tags and other latin script.

For Windows UTF-16LE might be more native.

Problem with placeholders for non-Unicode platforms, especially Windows, might happen.

Joop Eggen
  • 107,315
  • 7
  • 83
  • 138
  • Yes also Windows. The basic problem is that bytes must be interpreted with the correct encoding. Windows uses a country group specific single-byte encoding. The problem is that UTF-8 or UTF-16 must be recognized by the reading programs as needing special treatment. Hence the BOM. This will not work in every program. Best works an HTML file, as that can specify the charset. **If your program runs locally, you could do `s.getBytes()` using the local default encoding.** – Joop Eggen Sep 17 '14 at 08:12
  • your answer says *esoecuakky*. – Elliott Frisch Sep 17 '14 at 11:00
  • @ElliottFrisch thanks "especially", learning blind typing is definitely one of the best decisions in my life, but sometimes ... (Here the right hand was misplaced after I moved the mouse) – Joop Eggen Sep 17 '14 at 11:06
0

I just found this:

https://github.com/facebook/conceal/issues/138

which seems to answer negatively your question.

As per Jon Skeet's answer: the specification is clear. But I guess Android/Mac implementations of Dalvik/JVM don't agree.

helios
  • 13,574
  • 2
  • 45
  • 55