Converting Unicode to characters with bitwise operations

Question

I know how to convert Unicode to characters thanks to this question, but that doesn't work so well when I am doing bitwise operations on Unicode.

The .fromCharCode() is a Javascript function to convert Unicode into characters. I would like to know its equivalent in Java, capable to handle bitwise operations as parameters.

This code will not compile

public String str2rstr_utf8(String input) {
  String output = "";
  int i = -1;
  int x, y;
  while (++i < input.length()) {
    /* Decode utf-16 surrogate pairs */
    x = Character.codePointAt(input, i);
    y = i + 1 < input.length() ? Character.codePointAt(input, i + 1) : 0;
    if (0xD800 <= x && x <= 0xDBFF && 0xDC00 <= y && y <= 0xDFFF) {
      x = 0x10000 + ((x & 0x03FF) << 10) + (y & 0x03FF);
      i++;
    }
    /* Encode output as utf-8 */
    if (x <= 0x7F) output += String.fromCharCode(x);
    else if (x <= 0x7FF) output += String.fromCharCode(0xC0 | ((x >>> 6) & 0x1F), 0x80 | (x & 0x3F));
    else if (x <= 0xFFFF) output += String.fromCharCode(0xE0 | ((x >>> 12) & 0x0F), 0x80 | ((x >>> 6) & 0x3F), 0x80 | (x & 0x3F));
    else if (x <= 0x1FFFFF) output += String.fromCharCode(0xF0 | ((x >>> 18) & 0x07), 0x80 | ((x >>> 12) & 0x3F), 0x80 | ((x >>> 6) & 0x3F), 0x80 | (x & 0x3F));
  }
  return output;
}

`Character.codePointAt` already returns the supplementary code point (derived from the surrogate pair at index and index + 1). You don't need to compute it yourself. In fact, since you do it, you probably get the wrong result. — Codo, Dec 31 '12 at 19:14

score 2 · Accepted Answer · edited Dec 31 '12 at 22:55

2

If I'm not mistaken, you are trying to encode a Java string in UTF-8. There's direct support for it in Java:

public byte[] str2rstr_utf8(String str)
{
    return str.getBytes(Charset.forName("UTF-8"));
}

edited Dec 31 '12 at 22:55

syb0rg

8,057
9
41
81

answered Dec 31 '12 at 19:32

Codo

75,595
17
168
206

@syb0rg: I don't understand how your comment relates to my answer. – Codo Dec 31 '12 at 22:57
If you look at my answer, you see that it takes in integer parameters. That is because whatever is passed to `String.fromCharCode()` is an integer, and not a string. – syb0rg Dec 31 '12 at 22:59
@syb0rg: If you look at my answer, you'll see that it's an implementation of `str2rstr_utf8` and not of `String.fromCharCode`. – Codo Dec 31 '12 at 23:01
1

That gets rid off all the bit-wise operations going on in the function. – syb0rg Dec 31 '12 at 23:03

score 0 · Answer 2 · answered Jan 01 '13 at 02:58

0

What you are essentially doing is converting a UTF-16 encoded input string into a UTF-16 encoded output string whose characters contain the values of UTF-8 encoded bytes. You almost never need to do that in Unicode programming! But in the off-chance that you actually do need to (like interacting with a third-party API that requires such an oddly-formatted string), then you can accomplish the same thing by not dealing with bitwise operations manually, let Java do the work for you:

public String str2rstr_utf8(String input)
{
    byte[] utf8 = input.getBytes(Charset.forName("UTF-8"));
    StringBuilder output = new StringBuilder(utf8.length);
    for (int i = 0; i < utf8.length; ++i)
        output.append((char)utf8[i]);
    return output.toString();
}

answered Jan 01 '13 at 02:58

Remy Lebeau

555,201
31
458
770

Not an API, it's for encryption. – syb0rg Jan 01 '13 at 03:20
What kind of encrypt requires you to store UTF-8 encoded bytes inside of a UTF-16 encoded string? Unicode is hard enough as it is to use with encryption. Now you are adding an extra unnecessary complication to the mix. – Remy Lebeau Jan 01 '13 at 06:25
This can be done with just `return new String(input.getBytes(Charset.forName("UTF-8")), "ISO-8859-1")`. And in Javascript this is normal with many APIs treating strings as binary strings and no byte array in the language. – Esailija Jan 07 '13 at 12:14

Converting Unicode to characters with bitwise operations

2 Answers2