0

In my HBase table, there are some encoded emoji, like \xF0\x9F\x8C\x8F and \xE2\x9A\xBE. I am trying to use Bytes.toString() to decode them. However, this method use utf-8 which can only decode three bytes code like \xE2\x9A\xBE and the four bytes code like \xF0\x9F\x8C\x8F appears to be a question mark (see below). So how can I decode the four bytes code to emoji and print them out? Anybody has an idea? Thanks in advance!

Example:

The result should be:enter image description here

But I got enter image description here

I am so sorry that I forgot to mention that I am using servlet to query HBase and write the content to response.

dibugger
  • 546
  • 1
  • 7
  • 21
  • 1
    are you certain that the encoding is UTF-8? if so, then your conversion is correct, but your OS might not know how to represent the char behind your emoji – Japu_D_Cret Mar 31 '17 at 10:01
  • `new String(theBytes, theCharset)`? –  Mar 31 '17 at 10:02
  • Are you asking the method that I use - Bytes.toString() ? The official documentation shows that this method use utf-8 as default encoding and there is no way to change its encoding.... And I can look into the hbase and found that the emojis are stored in an unicode encoding like \xF0\x9F\x8C\x8F. – dibugger Mar 31 '17 at 10:05

1 Answers1

1

When I read a file that contains the following character ""(F09F8C8F or U+1F30F) and it has a BOM which indicates UTF-8 encoding and I correctly convert it to UTF-8 by using

byte[] encoded = Files.readAllBytes(selectedFile.toPath());
String fileContents = new String(encoded, StandardCharsets.UTF_8);

the resulting String is correctly converted and correctly displayed in my Java Swing application. But if I print the same String to the console I get a boxed question mark instead of the symbol. So the character is correctly converted, but it's just your output that gets it messed up.

To recreate this, you can use this:

public static void main(String[] args) throws Exception {
  byte[] encoded = { (byte) 0xF0, (byte) 0x9F, (byte) 0x8C, (byte) 0x8F };
  String convertedstring = new String(encoded, StandardCharsets.UTF_8);

  System.out.println("convertedstring: " + convertedstring);

  JDialog dialog = new JDialog();
  dialog.setSize(300, 100);
  dialog.setLocationRelativeTo(null);
  dialog.setTitle("encoding-test");
  dialog.setDefaultCloseOperation(WindowConstants.DISPOSE_ON_CLOSE);
  JLabel label = new JLabel("convertedstring: " + convertedstring);
  dialog.add(label);

  dialog.setVisible(true);
}

Console Output

enter image description here

JDialog Output

enter image description here

you might also wanna see Default character encoding for java console output and Java, UTF-8, and Windows console

Community
  • 1
  • 1
Japu_D_Cret
  • 632
  • 5
  • 18
  • Thanks for your reply. But I am so sorry that I forgot to mention that I am using servlet to query HBase and write the content to response. – dibugger Mar 31 '17 at 10:41
  • @DiLuo can you see the emoji in my response? If not than you might wanna update your browser. Also check which HTML content type your response has – Japu_D_Cret Mar 31 '17 at 10:48
  • 1
    I cannot see the emoji with your code but I set the content type and the encoding of the header to be UTF-16 and now I can see the emoji in the response. Thank you for your inspiration! – dibugger Mar 31 '17 at 10:59