Convert escaped Unicode character back to actual character

Question

I have the following value in a string variable in Java which has UTF-8 characters encoded like below

Dodd\u2013Frank

instead of

Dodd–Frank

(Assume that I don't have control over how this value is assigned to this string variable)

Now how do I convert (encode) it properly and store it back in a String variable?

I found the following code

Charset.forName("UTF-8").encode(str);

But this returns a ByteBuffer, but I want a String back.

Edit:

Some more additional information.

When I use System.out.println(str); I get

Dodd\u2013Frank

I am not sure what is the correct terminology (UTF-8 or unicode). Pardon me for that.

the question is unclear to me. When you `System.out.println(yourString);` do you see (1) `Dodd\u2013Frank` or (2) `Dodd–Frank` ? — jlordo, Dec 04 '12 at 10:06
Wrong, \u2013 is not an UTF-8 character, it is an escaped Unicode character. UTF-8 is a way of encoding UTF characters. — SirDarius, Dec 04 '12 at 10:06
@jlordo and SirDarius I have updated the question with details. — Sudar, Dec 04 '12 at 10:08
Have a look at [StringEscapeUtils.unescapeJava()](http://commons.apache.org/lang/api-2.4/org/apache/commons/lang/StringEscapeUtils.html#unescapeJava(java.lang.String)) — jlordo, Dec 04 '12 at 10:13
Check the Apache Doc: https://commons.apache.org/proper/commons-lang/javadocs/api-3.1/org/apache/commons/lang3/StringEscapeUtils.html — ΦXocę 웃 Пepeúpa ツ, Oct 19 '15 at 10:54
Just wanted to understand, why not `"Dodd\u2013Frank".chars().forEach(a -> System.out.print((char) a));` ? — Naman, Jul 10 '18 at 16:56
`org.apache.commons.lang3.StringEscapeUtils` is deprecated, but moved to `commons-text` as `import org.apache.commons.text.StringEscapeUtils` which is not deprecated. — Chris Wolf, Apr 05 '23 at 23:23

score 62 · Accepted Answer · edited Jul 19 '16 at 11:20

62

try

str = org.apache.commons.lang3.StringEscapeUtils.unescapeJava(str);

from Apache Commons Lang

edited Jul 19 '16 at 11:20

Mark Rotteveel

100,966
191
140
197

answered Dec 04 '12 at 10:16

jlordo

37,490
6
58
83

2

If `Java` itself provides the functionality of parsing the value then why should we use any third party tool ? – Bhavik Ambani Dec 04 '12 at 10:17
2

@BhavikAmbani Then please explain how, because your answer definitly does not. – SirDarius Dec 04 '12 at 10:19
@BhavikAmbani in your own example, try `System.out.println(string);` before calling `getBytes();` and see what happens ;) – jlordo Dec 04 '12 at 10:19
How come ? My answer solves the problem which is specified in the question asked, that convert unicode into readable string format. – Bhavik Ambani Dec 04 '12 at 10:20
@jlordo I have pasted that also you can check that this prints the perfect output, which I have taked from the console – Bhavik Ambani Dec 04 '12 at 10:21
1

@BhavikAmbani nope, when he prints out his string, he sees `Dodd\u2013Frank`, when we print your string we see `Dodd-Frank`. (before any conversion), his String is `"Dodd\\u2013Frank"`, your String is `"Dodd\u2013Frank"` – jlordo Dec 04 '12 at 10:21
2

This might solve your issue in a simple case, but be careful. If you are trying to use this solution, for example, on a JSON encoded string with UTF8 chars that you want unescaped, it will unescape things that you DONT want touched: For example, if this String is inside a piece of JSON "\u003ca href=\"http:\/\/twitter.com\" rel=\"nofollow\"\u003eTwitter Web Client\u003c\/a\u003e" – Justin Standard Jun 29 '16 at 03:25
str = org.apache.commons.text.StringEscapeUtils.unescapeJava(str); as commons.lang3 is deprecated. – user8091544 Jun 11 '21 at 06:39

score 16 · Answer 2 · edited Jun 20 '23 at 01:40

16

`java.util.Properties`

You can take advantage of the fact that java.util.Properties supports strings with \uXXXX escape sequences and do something like this:

Properties p = new Properties();
p.load(new StringReader("key = " + yourInputString));
System.out.println("Escaped value: " + p.getProperty("key"));

Inelegant, but functional.

To handle the possible IOExeception, you may want a try-catch.

Properties p = new Properties();
try { 
   p.load(new StringReader("key = " + input)); 
} catch (IOException e) { 
   e.printStackTrace();
}
System.out.println("Escaped value: " + p.getProperty("key"));

edited Jun 20 '23 at 01:40

quantum

3,000
5
41
56

answered Jun 04 '14 at 20:27

drobert

1,230
8
21

won't handle newlines – Łukasz Sep 28 '18 at 18:41
As written, true, though this solution could be applied to one line at a time. – drobert Oct 02 '18 at 19:49
Yeah, I am just warning people as I faced that. I actually replaced new lines with some special string, converted and converted back, worked like a charm, but not perfect for production code. – Łukasz Oct 03 '18 at 09:52
Works. Another approach is to read in one line at a time using a `BufferedReader` or `BufferedInputSteam` similar and apply this algorithm to one line at a time. – drobert Jan 13 '21 at 22:15

score 2 · Answer 3 · answered Jun 11 '21 at 06:40

2

try

str = org.apache.commons.text.StringEscapeUtils.unescapeJava(str);

as org.apache.commons.lang3.StringEscapeUtils is deprecated.

answered Jun 11 '21 at 06:40

user8091544

350
3
6

score 0 · Answer 4 · edited Jun 30 '16 at 18:36

0

Suppose you have a Unicode value, such as 00B0 (degree symbol, or superscript 'o', as in Spanish abbreviation for 'primero')

Here is a function that does just what you want:

public static String  unicodeToString( char  charValue )
{
    Character   ch = new Character( charValue );

    return ch.toString();
}

edited Jun 30 '16 at 18:36

Tony Hinkle

4,706
7
23
35

answered Jun 30 '16 at 18:31

score 0 · Answer 5 · answered Oct 26 '16 at 14:42

0

I used StringEscapeUtils.unescapeXml to unescape the string loaded from an API that gives XML result.

answered Oct 26 '16 at 14:42

Joy

9,430
11
44
95

score 0 · Answer 6 · answered Nov 04 '20 at 19:51

0

UnicodeUnescaper from org.apache.commons:commons-text is also acceptable.

new UnicodeUnescaper().translate("Dodd\u2013Frank")

answered Nov 04 '20 at 19:51

anton

675
6
16

`UnicodeUnescaper().translate(...)` needs a `writer` presumably a `StringWriter` - you may as well just use `import org.apache.commons.text.StringEscapeUtils.unescapeJava` from `commons-text`. – Chris Wolf Apr 05 '23 at 23:27

score -2 · Answer 7 · answered Oct 24 '18 at 20:42

-2

Perhaps the following solution which decodes the string correctly without any additional dependencies.

This works in a scala repl, though should work just as good in Java only solution.

import java.nio.charset.StandardCharsets
import java.nio.charset.Charset

> StandardCharsets.UTF_8.decode(Charset.forName("UTF-8").encode("Dodd\u2013Frank"))
res: java.nio.CharBuffer = Dodd–Frank

answered Oct 24 '18 at 20:42

cevaris

5,671
2
49
34

1

Tried this, but what is actually decoding the UTF-8 character is the fact that it is given directly in the String. What your example does, is to take a UTF-8 String, encode that, decode that, and - luckily - we get the same output as the input. – Florian Heer Mar 06 '19 at 11:47
curious, what is a string example would fail to convert for this solution? – cevaris Mar 07 '19 at 15:24
2

In the source "\u2013" is alread converted to the UTF-8 character. What would be a correct representation of the problem is "\\u2013" as the text to be converted contains the backslash and each character individually. – Florian Heer Mar 10 '19 at 11:46

score -3 · Answer 8 · edited Jun 21 '13 at 18:57

You can convert that byte buffer to String like this :

import java.nio.charset.Charset;
import java.nio.charset.CharsetDecoder;
import java.nio.ByteBuffer

public static CharsetDecoder decoder = CharsetDecoder.newDecoder();

public static String byteBufferToString(ByteBuffer buffer)
{
    String data = "";
    try 
    {
        // EDITOR'S NOTE -- There is no 'position' method for ByteBuffer.
        //                   As such, this is pseudocode.
        int old_position = buffer.position();
        data = decoder.decode(buffer).toString();
        // reset buffer's position to its original so it is not altered:
        buffer.position(old_position);  
    }
    catch (Exception e)
    {
        e.printStackTrace();
        return "";
    }
    return data;
 }

decoder is object of CharsetDecoder class in java.nio package.Sorry to update that.See the edited answer.Thanks for reminding me.:) — Abhishek_Mishra, Dec 04 '12 at 10:16

Convert escaped Unicode character back to actual character

8 Answers8

`java.util.Properties`

Linked

Related