How to decode emoji (unicode) in HBase using Java API?

Question

In my HBase table, there are some encoded emoji, like \xF0\x9F\x8C\x8F and \xE2\x9A\xBE. I am trying to use Bytes.toString() to decode them. However, this method use utf-8 which can only decode three bytes code like \xE2\x9A\xBE and the four bytes code like \xF0\x9F\x8C\x8F appears to be a question mark (see below). So how can I decode the four bytes code to emoji and print them out? Anybody has an idea? Thanks in advance!

Example:

The result should be:

But I got

I am so sorry that I forgot to mention that I am using servlet to query HBase and write the content to response.

are you certain that the encoding is UTF-8? if so, then your conversion is correct, but your OS might not know how to represent the char behind your emoji — Japu_D_Cret, Mar 31 '17 at 10:01
Are you asking the method that I use - Bytes.toString() ? The official documentation shows that this method use utf-8 as default encoding and there is no way to change its encoding.... And I can look into the hbase and found that the emojis are stored in an unicode encoding like \xF0\x9F\x8C\x8F. — dibugger, Mar 31 '17 at 10:05

score 1 · Answer 1 · edited May 23 '17 at 11:54

When I read a file that contains the following character ""(F09F8C8F or U+1F30F) and it has a BOM which indicates UTF-8 encoding and I correctly convert it to UTF-8 by using

byte[] encoded = Files.readAllBytes(selectedFile.toPath());
String fileContents = new String(encoded, StandardCharsets.UTF_8);

the resulting String is correctly converted and correctly displayed in my Java Swing application. But if I print the same String to the console I get a boxed question mark instead of the symbol. So the character is correctly converted, but it's just your output that gets it messed up.

To recreate this, you can use this:

public static void main(String[] args) throws Exception {
  byte[] encoded = { (byte) 0xF0, (byte) 0x9F, (byte) 0x8C, (byte) 0x8F };
  String convertedstring = new String(encoded, StandardCharsets.UTF_8);

  System.out.println("convertedstring: " + convertedstring);

  JDialog dialog = new JDialog();
  dialog.setSize(300, 100);
  dialog.setLocationRelativeTo(null);
  dialog.setTitle("encoding-test");
  dialog.setDefaultCloseOperation(WindowConstants.DISPOSE_ON_CLOSE);
  JLabel label = new JLabel("convertedstring: " + convertedstring);
  dialog.add(label);

  dialog.setVisible(true);
}

Console Output

JDialog Output

you might also wanna see Default character encoding for java console output and Java, UTF-8, and Windows console

Thanks for your reply. But I am so sorry that I forgot to mention that I am using servlet to query HBase and write the content to response. — dibugger, Mar 31 '17 at 10:41
@DiLuo can you see the emoji in my response? If not than you might wanna update your browser. Also check which HTML content type your response has — Japu_D_Cret, Mar 31 '17 at 10:48
I cannot see the emoji with your code but I set the content type and the encoding of the header to be UTF-16 and now I can see the emoji in the response. Thank you for your inspiration! — dibugger, Mar 31 '17 at 10:59

How to decode emoji (unicode) in HBase using Java API?

1 Answers1