Unfortunately, both snippets don't actually work, and that's because you misunderstand UTF-16 encoding. UTF-16 CAN encode those emojis, it is NOT fixed width. There is no such thing as 'fixed with UTF-16 encoding'. There's.. UCS2. Which is not UTF-16. The BE part doesn't make it 'fixed width', it merely locks in the endianness. That is why both of these print the roses. Java unfortunately doesn't ship with a UCS2 encoding system, which makes this job harder, and uglier.
Furthermore, Both snippets fail because you are calling forbidden methods.
Anytime you convert bytes to characters or vice versa, character conversion IS happening. You can't opt out of that. A bunch of methods nevertheless somehow exist which do not take any parameter to indicate which charset encoding you'd like to use for that. These are the forbidden methods: These default to 'system default', and look like somehow somebody waved a magic wand and made it so that we can convert chars to bytes or vice versa without worrying about character encoding.
The solution is to never use the forbidden methods. Better yet, tell your IDE it should flag them as error. The only exceptions are where you KNOW the API defaults not to 'platform default', but to something sane - the only one I know of, is the Files.*
API, which defaults to UTF-8 and not platform default. So, using the charset-less variants is acceptable there.
If you truly must have platform default (sensible for command line tools only), make it explicit by passing Charset.defaultCharset()
.
The list of forbidden methods is quite long, but new String(bytes)
and string.getBytes()
are both on it. Do not use these methods/constructors. Ever.
Furthermore your first snippet is all sorts of confused. You want to ENCODE a string (a string is already characters and has no encoding. It is what it is. So why are you making a decoder when there is nothing to decode?) to UTF-16, not decode it:
String in = "Red\uD83C\uDF39\uD83C\uDF39Röses";
CharBuffer input = CharBuffer.wrap(in);
CharsetEncoder utf16Encoder = StandardCharsets.UTF_16BE.newEncoder();
utf16Encoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
utf16Encoder.replaceWith(" ");
ByteBuffer encoded = utf16Encoder.encode(input);
System.out.println(new String(encoded.array(), StandardCharsets.UTF16_BE));
or second snippet:
@Test
public void testEncodeProblem() {
String in = "Red\uD83C\uDF39\uD83C\uDF39Röses";
byte[] bytes = in.getBytes(StandardCharsets.UTF_16BE);
String res = new String(bytes, StandardCharsets.UTF_16BE);
System.out.println(res);
}
But, as I said, both just print the roses, because those are representable in UTF_16.
So, how to get the job done? Had java had a UCS2 encoding built in, it'd be a simple as replacing StandardCharsets.UTF_16BE
with StandardCharsets.UCS2
, but no such luck. So, I guess... probably 'by hand':
String in = "Red\uD83C\uDF39\uD83C\uDF39Röses";
ByteArrayOutputStream out = new ByteArrayOutputStream();
in.codePoints()
.filter(a -> a < 65536)
.forEach(a -> {
out.write(a >> 8);
out.write(a);
});
// stream is ugly, but, because codePoints() was added in a time
// when oracle had just invented the shiny hammer, they are using it
// here for smearing butter on their sandwich. Silly geese. Oh well.
byte[] result = out.toByteArray();
// given that java has no way of reading UCS2, and UTF16BE doesn't fit,
// as there are chars representable in 2 bytes in UCS2 that take 3+ in
// UTF16BE, it's not possible to print this without another loop similar to above.
// Let's just print the bytes and check em, by hand:
for (byte r : result) System.out.print(" " + (r & 0xFF));
System.out.println();
// For the roses string, printing with UTF-16BE does actually work,
// but it won't be true for all input strings...
System.out.println(new String(result, StandardCharsets.UTF_16BE));
yay! Success!
NB: codePointAt
could work and avoid the ugly stream here, but cPA's input isn't in 'codepoint index' but in 'char index' and that makes matters rather complicated; you'd have to increment by 2 for any surrogate pair.
Some introspection on unicode, UCS2, and UTF-16:
Unicode is a gigantic table that maps any number between 0 and 1,112,064 (which is about 20 and a half bits) to a character, control concept, currency, punctuation, emoji, box drawing, or other characteresque concept.
An encoding like UTF-8 or US_ASCII defines a translation for some, or all, of these numbers into a series of bytes such that it can also be decoded back to a sequence of codepoints, which are commonly stored in 32-bits, because they don't fit in 16, and no architecture out there meaningfully deals in e.g. 24-bit or whatnot.
In order to accomodate UCS2/UTF-16, there are NO characters in the unicode spec from 0xD800 to 0xDFFF, and that is intentional, and there never will be.
This means UCS2 and UTF-16 are more or less the same thing, with one 'trick':
For any unicode number that is below 65536 (so could theoretically fit in 2 bytes), for UTF-16 encoding (which CAN encode emoji and such), the UTF-16 encoding is just.. the number. straight up. As 2 bytes. D800-DFFF can't happen, because those codepoints are intentionally not a thing.
For anything above 65536, that free block of D800 to DFFF is used in order to produce a so-called surrogate pair. A second 'character' (a second block of 2 bytes) combine with the 11 bits of data we can store with our D800-DFFF range for a total of 16+11 = 27 bits, more than enough to cover the rest.
Thus, UTF-16 will encode any unicode codepoint as either 2 bytes or 4 bytes.
UCS-2 as a term has mostly lost its meaning. Originally, it meant exactly 2 bytes per 'character', no more and no less, and it still means that, but the meaning of 'a character' has been twisted beyond recognition: That rose? It counts as 2 characters. Try it in java - x.length()
returns 2, not 1. A somewhat sane definition of UCS-2 as: 1 char really means 1 char, each char is represented by 2 bytes, and if you try to store a char that doesn't fit (would be a surrogate pair), well, those just cannot be encoded, so crash or apply the on-unreprestable-character-instead placeholder. Unfortunately, that's not (always) what UCS-2 means, which gets us back to having to write any code that applies this operation (discard / replace-with-placeholder any surrogate pairs so that length-in-bytes is exactly 2*number of codepoints) ourselves.
Note that this surrogate pair stuff provides you with a different strategy, based on the fact that java's char
is very close to the ideals of UCS2 (in that it is a 16-bit number, hardcoded in the java spec): You can just loop through all characters (as in, java's char
) and discard anything such that c >= 0xD800 && c < 0xE000
, as well as the immediately following character, which will get rid of the roses.