In java, a String
instance doesn't have an encoding. It just is - it represents the characters as characters, and therefore, there is no encoding.
Encoding just isn't a thing except in transition: When you 'transition' a bunch of characters into a bunch of bytes, or vice versa - that operation cannot be performed unless a charset is provided.
Take, for example, your snippet. It is broken. You write:
"TestData".getBytes()
.
This compiles. That is unfortunate; this is an API design error in java; you should never use these methods (That'd be: Methods that silently paper over the fact that a charset IS involved). This IS a transition from characters (A String) to bytes. If you read the javadoc on the getBytes()
method, it'll tell you that the 'platform default encoding' will be used. This means it's a fine formula for writing code that passes all tests on your machine and will then fail at runtime.
There are valid reasons to want platform default encoding, but I -strongly- encourage you to never use getBytes()
regardless. If you run into one of these rare scenarios, write "TestData".getBytes(Charset.defaultCharset())
so that your code makes explicit that a charset-using conversion is occurring here, and that you intended it to be the platform default.
So, going back to your question: There is no such thing as a UTF-16 string. (If 'string' here is the be taken as meaning: java.lang.String
, and not a slang english term meaning 'sequence of bytes').
There IS such a thing as a sequence of bytes, representing unicode characters encoded in UTF-16 format. In other words, 'a UTF-16 string', in java, would look like byte[]
. Not String
.
Thus, all you really need is:
byte[] utf16 = "TestData".GetBytes(StandardCharsets.UTF_16);
You write:
But that doesn't work as the string literal is interpreted as UTF8.
That's a property of the code then, not of the string. If you have some code you can't change that will turn a string into bytes using the UTF8 charset, and you don't want that to happen, then find the source and fix it. There is no other solution.
In particular, trying to hack things such that you have a string with gobbledygook that has the crazy property that if you take this gobbledygook, turn it into bytes using the UTF8 charset, and then take those bytes and turn that back into a string using the UTF16 charset, that you get what you actually wanted - cannot work. This is theoretically possible (but a truly bad idea) for charsets that have the property that every sequence of bytes is representable, such as ISO_8859_1, but UTF-8 does not adhere to that property. There are sequences of bytes that are just an error in UTF-8 and will cause an exception. On the flipside, it is not possible to craft a string such that decoding it with UTF-8 into a byte array produces a certain desired sequence of bytes.