-1

Is there a way to create an UTF16 string from scratch or from an actual UTF8 string that doesn't involve some weird "hack" like looping through each char and appending a 00 byte to make it an UTF16 char?

Ideally I would like to be able to do something like this:

String s = new String("TestData".getBytes(), StandardCharsets.UTF_16);

But that doesn't work as the string literal is interpreted as UTF8.

Mark Rotteveel
  • 100,966
  • 191
  • 140
  • 197
David S
  • 195
  • 5
  • 19
  • 3
    Does this answer your question? [How to convert UTF8 string to UTF16](https://stackoverflow.com/questions/13412174/how-to-convert-utf8-string-to-utf16) especially the second part of the accepted answer – jhamon Aug 10 '20 at 15:48
  • What are you trying to do here? Conceptually a string contains characters independent of any encoding. Encoding becomes only relevant when you convert to and from bytes. – Henry Aug 10 '20 at 15:48
  • 1
    @jhamon it is also worth mentioning the upvoted comment on this answer. One really should not do that. – Henry Aug 10 '20 at 15:51
  • @jhamon It's better than my workaround but it still looks quite an ugly fix for the problem. – David S Aug 10 '20 at 15:55
  • @Henry It's for a integration test I'm working on, I have a UTF16 string in my database and I want to compare it to a hard coded string that I create inside the test. Problem being that the hard coded string is UTF8 so I can't compare them, they will always fail. – David S Aug 10 '20 at 15:55
  • 1
    If you get the DB value as a string something has already gone wrong. There is only one kind of string in Java; there is no such thing as an UTF-16 string or UTF-8 string. – Henry Aug 10 '20 at 15:58
  • What do you mean? How is encoding not a part of Strings? If it wasn't you would just get raw byte arrays when doing things like logging data or just printing it to a console. – David S Aug 10 '20 at 16:08
  • On conceptual level a string does not contain bytes, but a sequence of characters. (of course there is an internal representation, but for almost all practical purposes this is irrelevant). Characters can be encoded to bytes (for example when written to a file). This is where encodings come in. – Henry Aug 10 '20 at 16:14

1 Answers1

4

In java, a String instance doesn't have an encoding. It just is - it represents the characters as characters, and therefore, there is no encoding.

Encoding just isn't a thing except in transition: When you 'transition' a bunch of characters into a bunch of bytes, or vice versa - that operation cannot be performed unless a charset is provided.

Take, for example, your snippet. It is broken. You write:

"TestData".getBytes().

This compiles. That is unfortunate; this is an API design error in java; you should never use these methods (That'd be: Methods that silently paper over the fact that a charset IS involved). This IS a transition from characters (A String) to bytes. If you read the javadoc on the getBytes() method, it'll tell you that the 'platform default encoding' will be used. This means it's a fine formula for writing code that passes all tests on your machine and will then fail at runtime.

There are valid reasons to want platform default encoding, but I -strongly- encourage you to never use getBytes() regardless. If you run into one of these rare scenarios, write "TestData".getBytes(Charset.defaultCharset()) so that your code makes explicit that a charset-using conversion is occurring here, and that you intended it to be the platform default.

So, going back to your question: There is no such thing as a UTF-16 string. (If 'string' here is the be taken as meaning: java.lang.String, and not a slang english term meaning 'sequence of bytes').

There IS such a thing as a sequence of bytes, representing unicode characters encoded in UTF-16 format. In other words, 'a UTF-16 string', in java, would look like byte[]. Not String.

Thus, all you really need is:

byte[] utf16 = "TestData".GetBytes(StandardCharsets.UTF_16);

You write:

But that doesn't work as the string literal is interpreted as UTF8.

That's a property of the code then, not of the string. If you have some code you can't change that will turn a string into bytes using the UTF8 charset, and you don't want that to happen, then find the source and fix it. There is no other solution.

In particular, trying to hack things such that you have a string with gobbledygook that has the crazy property that if you take this gobbledygook, turn it into bytes using the UTF8 charset, and then take those bytes and turn that back into a string using the UTF16 charset, that you get what you actually wanted - cannot work. This is theoretically possible (but a truly bad idea) for charsets that have the property that every sequence of bytes is representable, such as ISO_8859_1, but UTF-8 does not adhere to that property. There are sequences of bytes that are just an error in UTF-8 and will cause an exception. On the flipside, it is not possible to craft a string such that decoding it with UTF-8 into a byte array produces a certain desired sequence of bytes.

rzwitserloot
  • 85,357
  • 5
  • 51
  • 72
  • Thanks for the info, filled me in a bit more. One extra question though, the way you created the byte array above appended an extra four bytes before the actual data. What are those four bytes for? Edit: Ah, it was the encoding itself, when I changed it to LE or BE then it's represented as I expected :) – David S Aug 10 '20 at 16:26
  • @David that's mostly a unique thing to UTF-16, it's a 'byte order mark'. Just about every other encoding doesn't do that. On the plus side, any stream of bytes that starts with 0xFE 0xFF (or the reverse) is probably UTF-16 data. – rzwitserloot Aug 10 '20 at 16:44