Displaying UTF-8 Emoji in Java

Question

Say I have the (devil) emoji.

In 4-byte UTF-8, it's represented like so: \u00f0\u009f\u0098\u0088

However, in Java, it will only print correctly like so: \ud83d\ude08

How would I convert from the first to the second?

UPDATE 2

MNEMO's answer is much simpler, and answers my question, so it's probably better to go with his solution.

UPDATE

Thanks Basil Bourque for the write-up. It was very interesting.

I found a good reference here: https://github.com/pRizz/Unicode-Converter/blob/master/conversionfunctions.js (particularly the convertUTF82Char() function).

For anyone wandering by here in the future, here's what that looks like in Java:

public static String fromCharCode(int n) {
    char c = (char)n;
    return Character.toString(c);
}

public static String decToChar(int n) {
    // converts a single string representing a decimal number to a character
    // note that no checking is performed to ensure that this is just a hex number, eg. no spaces etc
    // dec: string, the dec codepoint to be converted
    String result = "";
    if (n <= 0xFFFF) {
        result += fromCharCode(n);
    } else if (n <= 0x10FFFF) {
        n -= 0x10000;
        result += fromCharCode(0xD800 | (n >> 10)) + fromCharCode(0xDC00 | (n & 0x3FF));
    } else {
        result += "dec2char error: Code point out of range: " + decToHex(n);
    }

    return result;
}

public static String decToHex(int n) {
    return Integer.toHexString(n).toUpperCase();
}

public static String convertUTF8_toChar(String str) {
    // converts to characters a sequence of space-separated hex numbers representing bytes in utf8
    // str: string, the sequence to be converted
    var outputString = "";
    var counter = 0;
    var n = 0;

    // remove leading and trailing spaces
    str = str.replaceAll("/^\\s+/", "");
    str = str.replaceAll("/\\s+$/", "");
    if (str.length() == 0) {
        return "";
    }

    str = str.replaceAll("/\\s+/g", " ");

    var listArray = str.split(" ");
    for (var i = 0; i < listArray.length; i++) {
        int b = parseInt(listArray[i], 16); // alert('b:'+dec2hex(b));
        switch (counter) {
            case 0:
                if (0 <= b && b <= 0x7F) { // 0xxxxxxx
                    outputString += decToChar(b);
                } else if (0xC0 <= b && b <= 0xDF) { // 110xxxxx
                    counter = 1;
                    n = b & 0x1F;
                } else if (0xE0 <= b && b <= 0xEF) { // 1110xxxx
                    counter = 2;
                    n = b & 0xF;
                } else if (0xF0 <= b && b <= 0xF7) { // 11110xxx
                    counter = 3;
                    n = b & 0x7;
                } else {
                    outputString += "convertUTF82Char: error1 " + decToHex(b) + "! ";
                }
                break;
            case 1:
                if (b < 0x80 || b > 0xBF) {
                    outputString += "convertUTF82Char: error2 " + decToHex(b) + "! ";
                }
                counter--;
                outputString += decToChar((n << 6) | (b - 0x80));
                n = 0;
                break;
            case 2:
            case 3:
                if (b < 0x80 || b > 0xBF) {
                    outputString += "convertUTF82Char: error3 " + decToHex(b) + "! ";
                }
                n = (n << 6) | (b - 0x80);
                counter--;
                break;
        }
    }

    return outputString.replaceAll("/ $/", "");
}

Pretty much a 1-for-1 copy, but it accomplishes my goal.

It is recommended to learn more about character encoding and Unicode system if you want to solve the problem. 4-byte UTF-8 is a sequence of bytes, but not an Unicode codepoint itself. — MNEMO, Jun 01 '20 at 04:43

Basil Bourque · Answer 1 · 2020-06-01T07:06:39.313

The SMILING FACE WITH HORNS character () is assigned to code point 128,520 decimal (1F608 hexadecimal) in Unicode.

You have a choice in how to represent that number with a series of octets.

UTF-8 is one way to represent that number with a variable length, using 1-4 octets.
- UTF-8 is becoming the dominant encoding in many spheres.
- Java source code files are usually written in UTF-8, in my experience, and as discussed here.
UTF-16 is another way, also variable-length, but using either 2 octets or 4.
- The Java language uses UTF-16 internally.
- UTF-8 is generally preferred over UTF-16, as discussed here.

In most text-editors, you can simply paste the single character into your source code. When written to a UTF-8 file, the editor will create the necessary series of octets.

When writing this character to a text file, or otherwise serializing to a stream of octets, you can choose to use either UTF-8 or UTF-16. See:

The following are a couple of trials. You can examine the resulting files with a hex editor to see the octets.

UTF-8

This code generates a file in UTF-8 encoding. We find four octets, hex values F0 9F 98 88, decimal values 240 159 152 136.

You can find this code discussed at the Oracle Java Tutorial.

Notice how we specify an encoding for our file, StandardCharsets.UTF_8.

Path file = Paths.get( "/Users/basilbourque/devil_utf-8.txt" );
Charset charset = StandardCharsets.UTF_8;
String s = "";
try ( BufferedWriter writer = Files.newBufferedWriter( file , charset ) )
{
    writer.write( s , 0 , s.length() );
}
catch ( IOException e )
{
    e.printStackTrace();
}

UTF-16

This code generates a file in UTF-16 encoding. We find 6 octets, 4 octets for our single character, plus a prefix of 2 octets for a BOM (FE FF). Our four octets in decimal are 216 061 222 008, in hex are D8 3D DE 08.

Same code as above, but we switched the Charset to StandardCharsets.UTF_16.

Path file = Paths.get( "/Users/basilbourque/devil_utf-16.txt" );
Charset charset = StandardCharsets.UTF_16;
String s = "";
try ( BufferedWriter writer = Files.newBufferedWriter( file , charset ) )
{
    writer.write( s , 0 , s.length() );
}
catch ( IOException e )
{
    e.printStackTrace();
}

About Unicode and encodings

To learn the basics of Unicode and encodings, read the post, The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

MNEMO · Accepted Answer · 2020-06-02T01:22:06.097

0

well, this is quite unnecessary to add, but after you understand all character encoding system and Unicode concept, following code might work for you.

byte[] a = { (byte)0xf0, (byte)0x9f, (byte)0x98, (byte)0x88 };
String s = new String(a,"UTF-8");
byte[] b = s.getBytes("UTF-16BE");
for ( byte c : b ) { System.out.printf("%02x ",c); }

edited Jun 02 '20 at 01:22

answered Jun 02 '20 at 01:13

MNEMO

268
2
11

It does indeed work, and it's much simpler than what I ended up with. Now all I'd have to do is print it in the format I stated. Thanks. – InfexiousBand Jun 02 '20 at 01:41

Displaying UTF-8 Emoji in Java

2 Answers2

UTF-8

UTF-16

About Unicode and encodings