3

I've a UTF-8(in literal) like this "\xE2\x80\x93."

I'm trying to convert this into Unicode using Java.

But I was not able to find a way to convert this.

Can anyone help me on this?

Regards, Sat

hippietrail
  • 15,848
  • 18
  • 99
  • 158
Sat
  • 51
  • 2
  • 7
  • You would have to parse the String into a `char[]` and then convert it into your desired `String`. – Luiggi Mendoza Jun 23 '13 at 16:01
  • the [`byte[]`](http://docs.oracle.com/javase/6/docs/api/java/lang/String.html#String(byte[])) constructor of String is the answer to your problem. If necessary, also provide the charset name. – Marko Topolnik Jun 23 '13 at 16:14
  • It's not clear what exactly do you have as input. Something like `String input = "\xE2\x80\x93";`? – axtavt Jun 23 '13 at 16:15
  • `"\xE2\x80\x93."` is not a valid string literal in Java. All string literals in Java are UTF-16. Can you be more explicit about where you are sourcing the data? – McDowell Jun 23 '13 at 16:28

2 Answers2

2
System.out.println(new String(new byte[] {
    (byte)0xE2, (byte)0x80, (byte)0x93 }, "UTF-8"));

prints an em-dash, which is what those three bytes encode. It is not clear from your question whether you have such three bytes, or literally the string you have posted. If you have the string, then simply parse it into bytes beforehand, for example with the following:

final String[] bstrs = "\\xE2\\x80\\x93".split("\\\\x");
final byte[] bytes = new byte[bstrs.length-1];
for (int i = 1; i < bstrs.length; i++)
  bytes[i] = (byte) ((Integer.parseInt(bstrs[i], 16) << 24) >> 24);
System.out.println(new String(bytes, "UTF-8"));
Marko Topolnik
  • 195,646
  • 29
  • 319
  • 436
2

You can use the Apache Commons Lang StringEscapeUtils

Or if you do know that the string will always be \xHH\xHH then you can:

String hex = input.replace("\x", "");
byte[] bytes = hexStringToByteArray(hex);
String result = new String(bytes, "utf-8");

hexStringToByteArray is here.

Also see this other SO answer.

Community
  • 1
  • 1
Ayman
  • 11,265
  • 16
  • 66
  • 92
  • We are using hadoop hbase table to store the data and when data is stored into hbase table, it gets stored in this format \xE2\x80\x93. When we try to convert this into Unicode using StringEscapeUtils and some other utilities..nothing helped. – Sat Jun 23 '13 at 16:53
  • 1
    Did you try my other suggestion, to manually convert the \xHH to bytearray and then decode? – Ayman Jun 23 '13 at 16:55
  • @Marko I've a string like this which contains that UTF-8 value. "We celebrate the ideas \xE2\x80\x93". If my string contains "We celebrate the idea\xE2\x80\x93s", I use regex to replace this with 0x. So it will look like "We celebrate the idea0xE20x80x93s". Is there way to parse this data and get only the hexadecimal value? – Sat Jun 23 '13 at 19:47