How to convert a string with Unicode encoding to a string of letters

Question

I have a string with escaped Unicode characters, \uXXXX, and I want to convert it to regular Unicode letters. For example:

"\u0048\u0065\u006C\u006C\u006F World"

should become

"Hello World"

I know that when I print the first string it already shows Hello world. My problem is I read file names from a file, and then I search for them. The files names in the file are escaped with Unicode encoding, and when I search for the files, I can't find them, since it searches for a file with \uXXXX in its name.

You're sure? You don't suppose that the characters are simply getting printed as Unicode escapes? — Hot Licks, Jun 21 '12 at 19:51
`\u0048` *is* `H` -- they are one and the same. Strings in Java are in Unicode. — Hot Licks, Jun 21 '12 at 19:54
I guess the problem might be with my java to unix api - the string i get is something like that \u3123\u3255_file_name.txt. And java don't covert it. — SharonBL, Jun 21 '12 at 20:05
Most likely you have a problem with the code page conversion when translating Java Unicode strings to the file system character set. — Hot Licks, Jun 21 '12 at 20:51
This is not an answer to your question but let me clarify the difference between Unicode and UTF-8, which many people seem to muddle up. Unicode is a particular *one-to-one* mapping between characters as we know them (`a`, `b`, `$`, `£`, etc) to the integers. E.g., the symbol `A` is given number 65, and `\n` is 10. This has *nothing* to do with how strings or characters are represented on disk or in a text file say. UTF-8 is a specification (i.e. encoding) of how these integers (i.e. symbols) are represented as bytes (bit strings) so they can be unambiguously written and read from say a file. — DustByte, Jan 27 '16 at 09:59

score 112 · Answer 1 · edited Jun 11 '16 at 08:00

112

The Apache Commons Lang StringEscapeUtils.unescapeJava() can decode it properly.

import org.apache.commons.lang.StringEscapeUtils;

@Test
public void testUnescapeJava() {
    String sJava="\\u0048\\u0065\\u006C\\u006C\\u006F";
    System.out.println("StringEscapeUtils.unescapeJava(sJava):\n" + StringEscapeUtils.unescapeJava(sJava));
}


 output:
 StringEscapeUtils.unescapeJava(sJava):
 Hello

edited Jun 11 '16 at 08:00

Max Leske

5,007
6
42
54

answered Jan 16 '13 at 21:29

Tony

1,458
2
11
13

String sJava="\u0048\\u0065\u006C\u006C\u006F"; -----> Please do simple change. – Shreyansh Shah Jun 20 '15 at 08:51
It appears StringEscapeUtils is now located in org.apache.commons.text.StringEscapeUtils – Austin Haws May 11 '22 at 21:57

NominSim · Accepted Answer · 2012-06-21T20:49:38.057

55

Technically doing:

String myString = "\u0048\u0065\u006C\u006C\u006F World";

automatically converts it to "Hello World", so I assume you are reading in the string from some file. In order to convert it to "Hello" you'll have to parse the text into the separate unicode digits, (take the \uXXXX and just get XXXX) then do Integer.ParseInt(XXXX, 16) to get a hex value and then case that to char to get the actual character.

Edit: Some code to accomplish this:

String str = myString.split(" ")[0];
str = str.replace("\\","");
String[] arr = str.split("u");
String text = "";
for(int i = 1; i < arr.length; i++){
    int hexVal = Integer.parseInt(arr[i], 16);
    text += (char)hexVal;
}
// Text will now have Hello

edited Jun 21 '12 at 20:49

answered Jun 21 '12 at 20:01

NominSim

8,447
3
28
38

Seems that might be the solution. Do you have an idea how can i do it in java - can i do it with String.replaceAll or something like that? – SharonBL Jun 21 '12 at 20:12
@SharonBL I updated with some code, should at least give you an idea of where to start. – NominSim Jun 21 '12 at 20:49
2

Thank you very much for you help! I also found another solution for that: String s = StringEscapeUtils.unescapeJava("\\u20ac\\n"); it does the work! – SharonBL Jun 21 '12 at 21:06
This does not work for surrogate pairs at all but is ok for ASCII or 'low' code points. Edit: now that I think about it a bit more, it will work OK with surrogate pairs too. – Scott Carey Jan 06 '17 at 21:33
2

attempt to reinvent methods provided by Standard Java Library. just check pure implementation https://stackoverflow.com/a/39265921/1511077 – Eugene Lebedev Mar 04 '18 at 17:31
1

I'm always amazed when a "**reinvent the wheel**" answer gets so many votes. – Pedro Lobito Apr 18 '18 at 10:13
Not works when there is non unicode characters inside string, such as: href=\u0022\/en\/blog\/d-day-protecting-europe-its-demons\u0022\u003E\n – Mohsen Abasi Jul 08 '19 at 11:46
@PedroLobito that's because the linked post does absolutely nothing. `myString.getBytes("UTF8")` and then back to `String` does nothing. – rustyx Mar 06 '20 at 11:30
I agree with rustyx because `getBytes` should contain source encoding as argument, not utf-8. After that string should be created with required encoding from ByteArray. – Eugene Lebedev Mar 08 '20 at 20:51
1

The type `Char` is only 16 bits, It will be overflow,if the unicode number gretter than 0XFFFF – light Mar 29 '23 at 05:31

score 35 · Answer 3 · edited Mar 07 '20 at 09:28

35

You can use StringEscapeUtils from Apache Commons Lang, i.e.:

String Title = StringEscapeUtils.unescapeJava("\\u0048\\u0065\\u006C\\u006C\\u006F");

edited Mar 07 '20 at 09:28

rustyx

80,671
25
200
267

answered Jun 20 '13 at 14:27

Pedro Lobito

94,083
31
258
268

5

after adding dependacy in build.gradle : compile 'commons-lang:commons-lang:2.6' above working fine. – Joseph Mekwan Dec 16 '15 at 09:11

score 10 · Answer 4 · edited Jun 17 '15 at 17:57

This simple method will work for most cases, but would trip up over something like "u005Cu005C" which should decode to the string "\u0048" but would actually decode "H" as the first pass produces "\u0048" as the working string which then gets processed again by the while loop.

static final String decode(final String in)
{
    String working = in;
    int index;
    index = working.indexOf("\\u");
    while(index > -1)
    {
        int length = working.length();
        if(index > (length-6))break;
        int numStart = index + 2;
        int numFinish = numStart + 4;
        String substring = working.substring(numStart, numFinish);
        int number = Integer.parseInt(substring,16);
        String stringStart = working.substring(0, index);
        String stringEnd   = working.substring(numFinish);
        working = stringStart + ((char)number) + stringEnd;
        index = working.indexOf("\\u");
    }
    return working;
}

attempt to reinvent methods provided by Standard Java Library. just check pure implementation https://stackoverflow.com/a/39265921/1511077 — Eugene Lebedev, Mar 04 '18 at 17:32
Thanks @EvgenyLebedev ... the standard library way looks good and presumably has been thoroughly tested, much appreciated. — andrew pate, Mar 14 '18 at 11:26

score 7 · Answer 5 · answered Jan 14 '15 at 12:41

Shorter version:

public static String unescapeJava(String escaped) {
    if(escaped.indexOf("\\u")==-1)
        return escaped;

    String processed="";

    int position=escaped.indexOf("\\u");
    while(position!=-1) {
        if(position!=0)
            processed+=escaped.substring(0,position);
        String token=escaped.substring(position+2,position+6);
        escaped=escaped.substring(position+6);
        processed+=(char)Integer.parseInt(token,16);
        position=escaped.indexOf("\\u");
    }
    processed+=escaped;

    return processed;
}

attempt to reinvent methods provided by Standard Java Library. just check pure implementation https://stackoverflow.com/a/39265921/1511077 — Eugene Lebedev, Mar 04 '18 at 17:30

score 7 · Answer 6 · answered Sep 17 '21 at 08:32

With Kotlin you can write your own extension function for String

fun String.unescapeUnicode() = replace("\\\\u([0-9A-Fa-f]{4})".toRegex()) {
    String(Character.toChars(it.groupValues[1].toInt(radix = 16)))
}

and then

fun main() {
    val originalString = "\\u0048\\u0065\\u006C\\u006C\\u006F World"
    println(originalString.unescapeUnicode())
}

Bogdan Kobylynskyi · Answer 7 · 2020-09-01T20:15:42.860

6

StringEscapeUtils from org.apache.commons.lang3 library is deprecated as of 3.6.

So you can use their new commons-text library instead:

compile 'org.apache.commons:commons-text:1.9'

OR

<dependency>
   <groupId>org.apache.commons</groupId>
   <artifactId>commons-text</artifactId>
   <version>1.9</version>
</dependency>

Example code:

org.apache.commons.text.StringEscapeUtils.unescapeJava(escapedString);

edited Sep 01 '20 at 20:15

answered Aug 17 '19 at 01:55

Bogdan Kobylynskyi

1,150
1
12
34

score 4 · Answer 8 · answered Jun 21 '12 at 19:57

It's not totally clear from your question, but I'm assuming you saying that you have a file where each line of that file is a filename. And each filename is something like this:

\u0048\u0065\u006C\u006C\u006F

In other words, the characters in the file of filenames are \, u, 0, 0, 4, 8 and so on.

If so, what you're seeing is expected. Java only translates \uXXXX sequences in string literals in source code (and when reading in stored Properties objects). When you read the contents you file you will have a string consisting of the characters \, u, 0, 0, 4, 8 and so on and not the string Hello.

So you will need to parse that string to extract the 0048, 0065, etc. pieces and then convert them to chars and make a string from those chars and then pass that string to the routine that opens the file.

score 4 · Answer 9 · answered Sep 04 '20 at 18:43

For Java 9+, you can use the new replaceAll method of Matcher class.

private static final Pattern UNICODE_PATTERN = Pattern.compile("\\\\u([0-9A-Fa-f]{4})");

public static String unescapeUnicode(String unescaped) {
    return UNICODE_PATTERN.matcher(unescaped).replaceAll(r -> String.valueOf((char) Integer.parseInt(r.group(1), 16)));
}

public static void main(String[] args) {
    String originalMessage = "\\u0048\\u0065\\u006C\\u006C\\u006F World";
    String unescapedMessage = unescapeUnicode(originalMessage);
    System.out.println(unescapedMessage);
}

I believe the main advantage of this approach over unescapeJava by StringEscapeUtils (besides not using an extra library) is that you can convert only the unicode characters (if you wish), since the latter converts all escaped Java characters (like \n or \t). If you prefer to convert all escaped characters the library is really the best option.

Ori Marko · Answer 10 · 2020-03-04T08:41:15.327

3

Updates regarding answers suggesting using The Apache Commons Lang's: StringEscapeUtils.unescapeJava() - it was deprecated,

Deprecated. as of 3.6, use commons-text StringEscapeUtils instead

The replacement is Apache Commons Text's StringEscapeUtils.unescapeJava()

edited Mar 04 '20 at 08:41

answered Nov 14 '18 at 09:50

Ori Marko

56,308
23
131
233

robertokl · Answer 11 · 2020-03-09T15:58:06.100

3

Just wanted to contribute my version, using regex:

private static final String UNICODE_REGEX = "\\\\u([0-9a-f]{4})";
private static final Pattern UNICODE_PATTERN = Pattern.compile(UNICODE_REGEX);
...
String message = "\u0048\u0065\u006C\u006C\u006F World";
Matcher matcher = UNICODE_PATTERN.matcher(message);
StringBuffer decodedMessage = new StringBuffer();
while (matcher.find()) {
  matcher.appendReplacement(
      decodedMessage, String.valueOf((char) Integer.parseInt(matcher.group(1), 16)));
}
matcher.appendTail(decodedMessage);
System.out.println(decodedMessage.toString());

edited Mar 09 '20 at 15:58

answered Oct 18 '19 at 16:42

robertokl

1,869
2
18
28

The regex is not complete. The regex referenced by Marcelo for the Java 9+ solution will work in this case for Java 8 `"\\\\u([0-9A-Fa-f]{4})"` – Jabrwoky Jul 22 '22 at 21:17

score 2 · Answer 12 · answered Aug 16 '19 at 02:05

I wrote a performanced and error-proof solution:

public static final String decode(final String in) {
    int p1 = in.indexOf("\\u");
    if (p1 < 0)
        return in;
    StringBuilder sb = new StringBuilder();
    while (true) {
        int p2 = p1 + 6;
        if (p2 > in.length()) {
            sb.append(in.subSequence(p1, in.length()));
            break;
        }
        try {
            int c = Integer.parseInt(in.substring(p1 + 2, p1 + 6), 16);
            sb.append((char) c);
            p1 += 6;
        } catch (Exception e) {
            sb.append(in.subSequence(p1, p1 + 2));
            p1 += 2;
        }
        int p0 = in.indexOf("\\u", p1);
        if (p0 < 0) {
            sb.append(in.subSequence(p1, in.length()));
            break;
        } else {
            sb.append(in.subSequence(p1, p0));
            p1 = p0;
        }
    }
    return sb.toString();
}

score 2 · Answer 13 · edited Jul 27 '22 at 15:51

2

UnicodeUnescaper from Apache Commons Text does exactly what you want, and ignores any other escape sequences.

String input = "\\u0048\\u0065\\u006C\\u006C\\u006F World";
String output = new UnicodeUnescaper().translate(input);
assert("Hello World".equals(output));
assert("\u0048\u0065\u006C\u006C\u006F World".equals(output));

Where input would be the string you are reading from a file.

edited Jul 27 '22 at 15:51

OrangeDog

36,653
12
122
207

answered Nov 04 '20 at 19:48

anton

675
6
16

score 1 · Answer 14 · edited Jun 11 '17 at 04:53

1

one easy way i know using JsonObject:

try {
    JSONObject json = new JSONObject();
    json.put("string", myString);
    String converted = json.getString("string");

} catch (JSONException e) {
    e.printStackTrace();
}

edited Jun 11 '17 at 04:53

Wasi Ahmad

35,739
32
114
161

answered Nov 21 '15 at 21:12

Ashkan Ghodrat

3,162
2
32
36

score 1 · Answer 15 · answered Dec 18 '19 at 12:57

Fast

 fun unicodeDecode(unicode: String): String {
        val stringBuffer = StringBuilder()
        var i = 0
        while (i < unicode.length) {
            if (i + 1 < unicode.length)
                if (unicode[i].toString() + unicode[i + 1].toString() == "\\u") {
                    val symbol = unicode.substring(i + 2, i + 6)
                    val c = Integer.parseInt(symbol, 16)
                    stringBuffer.append(c.toChar())
                    i += 5
                } else stringBuffer.append(unicode[i])
            i++
        }
        return stringBuffer.toString()
    }

score 0 · Answer 16 · answered May 28 '14 at 21:03

0

try

private static final Charset UTF_8 = Charset.forName("UTF-8");
private String forceUtf8Coding(String input) {return new String(input.getBytes(UTF_8), UTF_8))}

answered May 28 '14 at 21:03

Hao

1,476
3
15
20

score 0 · Answer 17 · answered May 22 '17 at 11:22

Actually, I wrote an Open Source library that contains some utilities. One of them is converting a Unicode sequence to String and vise-versa. I found it very useful. Here is the quote from the article about this library about Unicode converter:

Class StringUnicodeEncoderDecoder has methods that can convert a String (in any language) into a sequence of Unicode characters and vise-versa. For example a String "Hello World" will be converted into

"\u0048\u0065\u006c\u006c\u006f\u0020 \u0057\u006f\u0072\u006c\u0064"

and may be restored back.

Here is the link to entire article that explains what Utilities the library has and how to get the library to use it. It is available as Maven artifact or as source from Github. It is very easy to use. Open Source Java library with stack trace filtering, Silent String parsing Unicode converter and Version comparison

score 0 · Answer 18 · answered Sep 14 '17 at 01:27

Here is my solution...

                String decodedName = JwtJson.substring(startOfName, endOfName);

                StringBuilder builtName = new StringBuilder();

                int i = 0;

                while ( i < decodedName.length() )
                {
                    if ( decodedName.substring(i).startsWith("\\u"))
                    {
                        i=i+2;
                        builtName.append(Character.toChars(Integer.parseInt(decodedName.substring(i,i+4), 16)));
                        i=i+4;
                    }
                    else
                    {
                        builtName.append(decodedName.charAt(i));
                        i = i+1;
                    }
                };

attempt to reinvent standard methods provided by Standard Java Library. just check pure implementation https://stackoverflow.com/a/39265921/1511077 — Eugene Lebedev, Mar 04 '18 at 17:30

score 0 · Answer 19 · answered Jan 21 '19 at 08:14

I found that many of the answers did not address the issue of "Supplementary Characters". Here is the correct way to support it. No third-party libraries, pure Java implementation.

http://www.oracle.com/us/technologies/java/supplementary-142654.html

public static String fromUnicode(String unicode) {
    String str = unicode.replace("\\", "");
    String[] arr = str.split("u");
    StringBuffer text = new StringBuffer();
    for (int i = 1; i < arr.length; i++) {
        int hexVal = Integer.parseInt(arr[i], 16);
        text.append(Character.toChars(hexVal));
    }
    return text.toString();
}

public static String toUnicode(String text) {
    StringBuffer sb = new StringBuffer();
    for (int i = 0; i < text.length(); i++) {
        int codePoint = text.codePointAt(i);
        // Skip over the second char in a surrogate pair
        if (codePoint > 0xffff) {
            i++;
        }
        String hex = Integer.toHexString(codePoint);
        sb.append("\\u");
        for (int j = 0; j < 4 - hex.length(); j++) {
            sb.append("0");
        }
        sb.append(hex);
    }
    return sb.toString();
}

@Test
public void toUnicode() {
    System.out.println(toUnicode(""));
    System.out.println(toUnicode(""));
    System.out.println(toUnicode("Hello World"));
}
// output:
// \u1f60a
// \u1f970
// \u0048\u0065\u006c\u006c\u006f\u0020\u0057\u006f\u0072\u006c\u0064

@Test
public void fromUnicode() {
    System.out.println(fromUnicode("\\u1f60a"));
    System.out.println(fromUnicode("\\u1f970"));
    System.out.println(fromUnicode("\\u0048\\u0065\\u006c\\u006c\\u006f\\u0020\\u0057\\u006f\\u0072\\u006c\\u0064"));
}
// output:
// 
// 
// Hello World

Not works when there is non unicode characters inside string, such as: href=\u0022\/en\/blog\/d-day-protecting-europe-its-demons\u0022\u003E\n — Mohsen Abasi, Jul 08 '19 at 11:46

score 0 · Answer 20 · answered Sep 14 '20 at 11:01

@NominSim There may be other character, so I should detect it by length.

private String forceUtf8Coding(String str) {
    str = str.replace("\\","");
    String[] arr = str.split("u");
    StringBuilder text = new StringBuilder();
    for(int i = 1; i < arr.length; i++){
        String a = arr[i];
        String b = "";
        if (arr[i].length() > 4){
            a = arr[i].substring(0, 4);
            b = arr[i].substring(4);
        }
        int hexVal = Integer.parseInt(a, 16);
        text.append((char) hexVal).append(b);
    }
    return text.toString();
}

score 0 · Answer 21 · answered Mar 29 '23 at 05:58

To do this, no need to depend 3-part library. Just use the java built-in library.

Assuming that we have a unicode '1F914',

firstly we convert it from hex to decimal using Integer.parseInt
then we pass the decimal representation to Character.toChars(), got a char array.

Java use UTF-16 to encode String. If a character that codepoint is over than 16 bit, using a char array to represent it.

After that,we new a sting by the char array. Finally we got a emoji : “”

new String(Character.toChars(Integer.parseInt(unicode, 16)));

score -1 · Answer 22 · answered Jul 10 '18 at 16:22

-1

An alternate way of accomplishing this could be to make use of chars() introduced with Java 9, this can be used to iterate over the characters making sure any char which maps to a surrogate code point is passed through uninterpreted. This can be used as:-

String myString = "\u0048\u0065\u006C\u006C\u006F World";
myString.chars().forEach(a -> System.out.print((char)a));
// would print "Hello World"

answered Jul 10 '18 at 16:22

Naman

27,789
26
218
353

So would just `System.out.println("\u0048\u0065\u006C\u006C\u006F World")` – OrangeDog Jul 27 '22 at 15:46

Eugene Lebedev · Answer 23 · 2020-03-08T20:53:56.003

-2

Solution for Kotlin:

val sourceContent = File("test.txt").readText(Charset.forName("windows-1251"))
val result = String(sourceContent.toByteArray())

Kotlin uses UTF-8 everywhere as default encoding.

Method toByteArray() has default argument - Charsets.UTF_8.

edited Mar 08 '20 at 20:53

answered Mar 04 '18 at 17:02

Eugene Lebedev

1,400
1
18
30

`String(string.toByteArray())` achieves literally nothing. – rustyx Mar 07 '20 at 09:25
@rustyx Method `toByteArray()` has default argument with `Charsets.UTF_8`. Then you create a string from bytearray with required encoding. I did test today with `windows-1251` to utf-8, it works. Also i did comparison at byte level :) – Eugene Lebedev Mar 08 '20 at 20:45
@rustyx here is a gist for you - https://gist.github.com/lebe-dev/31e31a3399c7885e298ed86810504676 – Eugene Lebedev Mar 08 '20 at 20:48

How to convert a string with Unicode encoding to a string of letters

23 Answers23

Linked

Related