35

I have the following value in a string variable in Java which has UTF-8 characters encoded like below

Dodd\u2013Frank

instead of

Dodd–Frank

(Assume that I don't have control over how this value is assigned to this string variable)

Now how do I convert (encode) it properly and store it back in a String variable?

I found the following code

Charset.forName("UTF-8").encode(str);

But this returns a ByteBuffer, but I want a String back.

Edit:

Some more additional information.

When I use System.out.println(str); I get

Dodd\u2013Frank

I am not sure what is the correct terminology (UTF-8 or unicode). Pardon me for that.

unwind
  • 391,730
  • 64
  • 469
  • 606
Sudar
  • 18,954
  • 30
  • 85
  • 131
  • 1
    the question is unclear to me. When you `System.out.println(yourString);` do you see (1) `Dodd\u2013Frank` or (2) `Dodd–Frank` ? – jlordo Dec 04 '12 at 10:06
  • 7
    Wrong, \u2013 is not an UTF-8 character, it is an escaped Unicode character. UTF-8 is a way of encoding UTF characters. – SirDarius Dec 04 '12 at 10:06
  • @jlordo and SirDarius I have updated the question with details. – Sudar Dec 04 '12 at 10:08
  • 4
    Have a look at [StringEscapeUtils.unescapeJava()](http://commons.apache.org/lang/api-2.4/org/apache/commons/lang/StringEscapeUtils.html#unescapeJava(java.lang.String)) – jlordo Dec 04 '12 at 10:13
  • Check the Apache Doc: https://commons.apache.org/proper/commons-lang/javadocs/api-3.1/org/apache/commons/lang3/StringEscapeUtils.html – ΦXocę 웃 Пepeúpa ツ Oct 19 '15 at 10:54
  • Just wanted to understand, why not `"Dodd\u2013Frank".chars().forEach(a -> System.out.print((char) a));` ? – Naman Jul 10 '18 at 16:56
  • `org.apache.commons.lang3.StringEscapeUtils` is deprecated, but moved to `commons-text` as `import org.apache.commons.text.StringEscapeUtils` which is not deprecated. – Chris Wolf Apr 05 '23 at 23:23

8 Answers8

62

try

str = org.apache.commons.lang3.StringEscapeUtils.unescapeJava(str);

from Apache Commons Lang

Mark Rotteveel
  • 100,966
  • 191
  • 140
  • 197
jlordo
  • 37,490
  • 6
  • 58
  • 83
  • 2
    If `Java` itself provides the functionality of parsing the value then why should we use any third party tool ? – Bhavik Ambani Dec 04 '12 at 10:17
  • 2
    @BhavikAmbani Then please explain how, because your answer definitly does not. – SirDarius Dec 04 '12 at 10:19
  • @BhavikAmbani in your own example, try `System.out.println(string);` before calling `getBytes();` and see what happens ;) – jlordo Dec 04 '12 at 10:19
  • How come ? My answer solves the problem which is specified in the question asked, that convert unicode into readable string format. – Bhavik Ambani Dec 04 '12 at 10:20
  • @jlordo I have pasted that also you can check that this prints the perfect output, which I have taked from the console – Bhavik Ambani Dec 04 '12 at 10:21
  • 1
    @BhavikAmbani nope, when he prints out his string, he sees `Dodd\u2013Frank`, when we print your string we see `Dodd-Frank`. (before any conversion), his String is `"Dodd\\u2013Frank"`, your String is `"Dodd\u2013Frank"` – jlordo Dec 04 '12 at 10:21
  • 2
    This might solve your issue in a simple case, but be careful. If you are trying to use this solution, for example, on a JSON encoded string with UTF8 chars that you want unescaped, it will unescape things that you DONT want touched: For example, if this String is inside a piece of JSON "\u003ca href=\"http:\/\/twitter.com\" rel=\"nofollow\"\u003eTwitter Web Client\u003c\/a\u003e" – Justin Standard Jun 29 '16 at 03:25
  • str = org.apache.commons.text.StringEscapeUtils.unescapeJava(str); as commons.lang3 is deprecated. – user8091544 Jun 11 '21 at 06:39
16

java.util.Properties

You can take advantage of the fact that java.util.Properties supports strings with \uXXXX escape sequences and do something like this:

Properties p = new Properties();
p.load(new StringReader("key = " + yourInputString));
System.out.println("Escaped value: " + p.getProperty("key"));

Inelegant, but functional.

To handle the possible IOExeception, you may want a try-catch.

Properties p = new Properties();
try { 
   p.load(new StringReader("key = " + input)); 
} catch (IOException e) { 
   e.printStackTrace();
}
System.out.println("Escaped value: " + p.getProperty("key"));
quantum
  • 3,000
  • 5
  • 41
  • 56
drobert
  • 1,230
  • 8
  • 21
  • won't handle newlines – Łukasz Sep 28 '18 at 18:41
  • As written, true, though this solution could be applied to one line at a time. – drobert Oct 02 '18 at 19:49
  • Yeah, I am just warning people as I faced that. I actually replaced new lines with some special string, converted and converted back, worked like a charm, but not perfect for production code. – Łukasz Oct 03 '18 at 09:52
  • Works. Another approach is to read in one line at a time using a `BufferedReader` or `BufferedInputSteam` similar and apply this algorithm to one line at a time. – drobert Jan 13 '21 at 22:15
2

try

str = org.apache.commons.text.StringEscapeUtils.unescapeJava(str);

as org.apache.commons.lang3.StringEscapeUtils is deprecated.

user8091544
  • 350
  • 3
  • 6
0

Suppose you have a Unicode value, such as 00B0 (degree symbol, or superscript 'o', as in Spanish abbreviation for 'primero')

Here is a function that does just what you want:

public static String  unicodeToString( char  charValue )
{
    Character   ch = new Character( charValue );

    return ch.toString();
}
Tony Hinkle
  • 4,706
  • 7
  • 23
  • 35
0

I used StringEscapeUtils.unescapeXml to unescape the string loaded from an API that gives XML result.

Joy
  • 9,430
  • 11
  • 44
  • 95
0

UnicodeUnescaper from org.apache.commons:commons-text is also acceptable.

new UnicodeUnescaper().translate("Dodd\u2013Frank")

anton
  • 675
  • 6
  • 16
  • `UnicodeUnescaper().translate(...)` needs a `writer` presumably a `StringWriter` - you may as well just use `import org.apache.commons.text.StringEscapeUtils.unescapeJava` from `commons-text`. – Chris Wolf Apr 05 '23 at 23:27
-2

Perhaps the following solution which decodes the string correctly without any additional dependencies.

This works in a scala repl, though should work just as good in Java only solution.

import java.nio.charset.StandardCharsets
import java.nio.charset.Charset

> StandardCharsets.UTF_8.decode(Charset.forName("UTF-8").encode("Dodd\u2013Frank"))
res: java.nio.CharBuffer = Dodd–Frank
cevaris
  • 5,671
  • 2
  • 49
  • 34
  • 1
    Tried this, but what is actually decoding the UTF-8 character is the fact that it is given directly in the String. What your example does, is to take a UTF-8 String, encode that, decode that, and - luckily - we get the same output as the input. – Florian Heer Mar 06 '19 at 11:47
  • curious, what is a string example would fail to convert for this solution? – cevaris Mar 07 '19 at 15:24
  • 2
    In the source "\u2013" is alread converted to the UTF-8 character. What would be a correct representation of the problem is "\\u2013" as the text to be converted contains the backslash and each character individually. – Florian Heer Mar 10 '19 at 11:46
-3

You can convert that byte buffer to String like this :

import java.nio.charset.Charset;
import java.nio.charset.CharsetDecoder;
import java.nio.ByteBuffer

public static CharsetDecoder decoder = CharsetDecoder.newDecoder();

public static String byteBufferToString(ByteBuffer buffer)
{
    String data = "";
    try 
    {
        // EDITOR'S NOTE -- There is no 'position' method for ByteBuffer.
        //                   As such, this is pseudocode.
        int old_position = buffer.position();
        data = decoder.decode(buffer).toString();
        // reset buffer's position to its original so it is not altered:
        buffer.position(old_position);  
    }
    catch (Exception e)
    {
        e.printStackTrace();
        return "";
    }
    return data;
 }
root
  • 471
  • 5
  • 18
Abhishek_Mishra
  • 4,551
  • 4
  • 25
  • 38
  • decoder is object of CharsetDecoder class in java.nio package.Sorry to update that.See the edited answer.Thanks for reminding me.:) – Abhishek_Mishra Dec 04 '12 at 10:16