Transform UTF8 string to UCS-2 with replace invalid characters in java

Question

I have a sting in UTF8:

"RedRöses"

I need that to be converted to valid UCS-2(or fixed size UTF-16BE without BOM, they are the same things) encoding, so the output will be: "Red Röses" as the "" out of range of UCS-2.

What I have tried:

 @Test
public void testEncodeProblem() throws CharacterCodingException {
    String in = "Red\uD83C\uDF39\uD83C\uDF39Röses";
    ByteBuffer input = ByteBuffer.wrap(in.getBytes());

    CharsetDecoder utf8Decoder = StandardCharsets.UTF_16BE.newDecoder();
    utf8Decoder.onMalformedInput(CodingErrorAction.REPLACE);
    utf8Decoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
    utf8Decoder.replaceWith(" ");

    CharBuffer decoded = utf8Decoder.decode(input);

    System.out.println(decoded.toString()); //  剥擰龌맰龌륒쎶獥 
}

Nope.

    @Test
public void testEncodeProblem() {
    String in = "Red\uD83C\uDF39\uD83C\uDF39Röses";
    byte[] bytes = in.getBytes(StandardCharsets.UTF_16BE);
    String res = new String(bytes);
    System.out.println(res); //  Red�<�9�<�9Röses
}

Nope.

Note that "ö" is a valid UCS-2 symbol.

Any ideas/libraries?

Are you sure your questions shouldn't be named [Remove emojis from String](https://stackoverflow.com/questions/49510006/remove-and-other-such-emojis-images-signs-from-java-strings)? — Kayaman, Nov 16 '20 at 17:52
No. I should pass any symbols in range of 65536. That includes symbols that are used as emojis. — msangel, Nov 16 '20 at 18:38

rzwitserloot · Accepted Answer · 2020-11-16T19:11:53.500

Unfortunately, both snippets don't actually work, and that's because you misunderstand UTF-16 encoding. UTF-16 CAN encode those emojis, it is NOT fixed width. There is no such thing as 'fixed with UTF-16 encoding'. There's.. UCS2. Which is not UTF-16. The BE part doesn't make it 'fixed width', it merely locks in the endianness. That is why both of these print the roses. Java unfortunately doesn't ship with a UCS2 encoding system, which makes this job harder, and uglier.

Furthermore, Both snippets fail because you are calling forbidden methods.

Anytime you convert bytes to characters or vice versa, character conversion IS happening. You can't opt out of that. A bunch of methods nevertheless somehow exist which do not take any parameter to indicate which charset encoding you'd like to use for that. These are the forbidden methods: These default to 'system default', and look like somehow somebody waved a magic wand and made it so that we can convert chars to bytes or vice versa without worrying about character encoding.

The solution is to never use the forbidden methods. Better yet, tell your IDE it should flag them as error. The only exceptions are where you KNOW the API defaults not to 'platform default', but to something sane - the only one I know of, is the Files.* API, which defaults to UTF-8 and not platform default. So, using the charset-less variants is acceptable there.

If you truly must have platform default (sensible for command line tools only), make it explicit by passing Charset.defaultCharset().

The list of forbidden methods is quite long, but new String(bytes) and string.getBytes() are both on it. Do not use these methods/constructors. Ever.

Furthermore your first snippet is all sorts of confused. You want to ENCODE a string (a string is already characters and has no encoding. It is what it is. So why are you making a decoder when there is nothing to decode?) to UTF-16, not decode it:

String in = "Red\uD83C\uDF39\uD83C\uDF39Röses";
CharBuffer input = CharBuffer.wrap(in);
CharsetEncoder utf16Encoder = StandardCharsets.UTF_16BE.newEncoder();
utf16Encoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
utf16Encoder.replaceWith(" ");
ByteBuffer encoded = utf16Encoder.encode(input);

System.out.println(new String(encoded.array(), StandardCharsets.UTF16_BE));

or second snippet:

@Test
public void testEncodeProblem() {
    String in = "Red\uD83C\uDF39\uD83C\uDF39Röses";
    byte[] bytes = in.getBytes(StandardCharsets.UTF_16BE);
    String res = new String(bytes, StandardCharsets.UTF_16BE);
    System.out.println(res);
}

But, as I said, both just print the roses, because those are representable in UTF_16.

So, how to get the job done? Had java had a UCS2 encoding built in, it'd be a simple as replacing StandardCharsets.UTF_16BE with StandardCharsets.UCS2, but no such luck. So, I guess... probably 'by hand':

String in = "Red\uD83C\uDF39\uD83C\uDF39Röses";
ByteArrayOutputStream out = new ByteArrayOutputStream();
in.codePoints()
    .filter(a -> a < 65536)
    .forEach(a -> {
       out.write(a >> 8);
       out.write(a);
    });

// stream is ugly, but, because codePoints() was added in a time
// when oracle had just invented the shiny hammer, they are using it
// here for smearing butter on their sandwich. Silly geese. Oh well.

byte[] result = out.toByteArray();
// given that java has no way of reading UCS2, and UTF16BE doesn't fit,
// as there are chars representable in 2 bytes in UCS2 that take 3+ in
// UTF16BE, it's not possible to print this without another loop similar to above. 
// Let's just print the bytes and check em, by hand:

for (byte r : result) System.out.print(" " + (r & 0xFF));
System.out.println();
// For the roses string, printing with UTF-16BE does actually work,
// but it won't be true for all input strings...
System.out.println(new String(result, StandardCharsets.UTF_16BE));

yay! Success!

NB: codePointAt could work and avoid the ugly stream here, but cPA's input isn't in 'codepoint index' but in 'char index' and that makes matters rather complicated; you'd have to increment by 2 for any surrogate pair.

Some introspection on unicode, UCS2, and UTF-16:

Unicode is a gigantic table that maps any number between 0 and 1,112,064 (which is about 20 and a half bits) to a character, control concept, currency, punctuation, emoji, box drawing, or other characteresque concept.

An encoding like UTF-8 or US_ASCII defines a translation for some, or all, of these numbers into a series of bytes such that it can also be decoded back to a sequence of codepoints, which are commonly stored in 32-bits, because they don't fit in 16, and no architecture out there meaningfully deals in e.g. 24-bit or whatnot.

In order to accomodate UCS2/UTF-16, there are NO characters in the unicode spec from 0xD800 to 0xDFFF, and that is intentional, and there never will be.

This means UCS2 and UTF-16 are more or less the same thing, with one 'trick':

For any unicode number that is below 65536 (so could theoretically fit in 2 bytes), for UTF-16 encoding (which CAN encode emoji and such), the UTF-16 encoding is just.. the number. straight up. As 2 bytes. D800-DFFF can't happen, because those codepoints are intentionally not a thing.

For anything above 65536, that free block of D800 to DFFF is used in order to produce a so-called surrogate pair. A second 'character' (a second block of 2 bytes) combine with the 11 bits of data we can store with our D800-DFFF range for a total of 16+11 = 27 bits, more than enough to cover the rest.

Thus, UTF-16 will encode any unicode codepoint as either 2 bytes or 4 bytes.

UCS-2 as a term has mostly lost its meaning. Originally, it meant exactly 2 bytes per 'character', no more and no less, and it still means that, but the meaning of 'a character' has been twisted beyond recognition: That rose? It counts as 2 characters. Try it in java - x.length() returns 2, not 1. A somewhat sane definition of UCS-2 as: 1 char really means 1 char, each char is represented by 2 bytes, and if you try to store a char that doesn't fit (would be a surrogate pair), well, those just cannot be encoded, so crash or apply the on-unreprestable-character-instead placeholder. Unfortunately, that's not (always) what UCS-2 means, which gets us back to having to write any code that applies this operation (discard / replace-with-placeholder any surrogate pairs so that length-in-bytes is exactly 2*number of codepoints) ourselves.

Note that this surrogate pair stuff provides you with a different strategy, based on the fact that java's char is very close to the ideals of UCS2 (in that it is a 16-bit number, hardcoded in the java spec): You can just loop through all characters (as in, java's char) and discard anything such that c >= 0xD800 && c < 0xE000, as well as the immediately following character, which will get rid of the roses.

Thanks! Regarding surrogate pairs, are they exists for UCS2? If no, that also need to be cleared probably. Yes, it hardly depends on consumer probably... — msangel, Nov 16 '20 at 18:34
BTW, java has build-in UCS2: `Charset.forName("ISO-10646-UCS-2")` — msangel, Nov 16 '20 at 18:46
@msangel surrogate pairs are how java (which is more or less ucs2) deals with emojis and the like (things that don't fit in 2 bytes). That `.filter()` already gets rid of em. I did a full scan of every charset in my build of the JDK, and that charset is not in there, so __do not use that__, it'll fail on most platforms you run java on. — rzwitserloot, Nov 16 '20 at 18:54
Based on [this](https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html) `ISO-10646-UCS-2` is just an alias for `UTF16-BE` (and not "true" UCS-2). — Kayaman, Nov 16 '20 at 18:57
Hmm, ISO-10646-UCS-2 works here, but if I run the first snippet with it, roses show up, so, no go. It apparently can encode em. That also explains why ISO-10646-UCS2 doesn't show when you run Charset.availableCharsets(), which doesn't show all aliases. — rzwitserloot, Nov 16 '20 at 18:58
@msangel I looked into it some more (this was an interesting foray in the exact workings of UCS-2/UTF-16 and how we got where we are), and edited the answer. everything below the horizontal line is new. — rzwitserloot, Nov 16 '20 at 19:12
yes, thanks, I'm good with `.filter(a -> a < 65536)` as the consumer is a legacy database that uses "true" UCS-2 internally but also uses converters from/to UTF-8(and throwing an exception for chars out of 65536 range) so if something that will come inside will be converted to UCS-2 own way, it will be converted back correctly either way — msangel, Nov 16 '20 at 19:20
UCS-2 and UTF-16 are practically the same encoding. Just that UCS-2 will interpret 2 characters, where UTF-16 will interpret one character. But this is not a problem. Many glyphs in UTF-16 must be expressed by many characters (base characters + combining one). Just see high surrogate as "base character" and low surrogate as combining character. Javascript count string length as number of "UTF16 code **units**", which is the same as UCS-2 charcater count. Just with escaping (e.g. \uXXXX) things may be different, but escaping is above encoding. — Giacomo Catenazzi, Nov 17 '20 at 14:08

Transform UTF8 string to UCS-2 with replace invalid characters in java

1 Answers1

Linked