2

I have a string with HTML encoding like below:

Ðột nhiên, ở gốc Tây Bắc văng vẳng có tiếng vó ngựa dồn dập.

I want to convert this String to Unicode. Expected output:

Ðột nhiên, ở gốc Tây Bắc văng vẳng có tiếng vó ngựa dồn dập.

I found a solution by Convert Decimal NCRs Code into UTF-8 in java (JSP) but it only works for strings with all characters which has its format begins with &#.

With characters begin with &xxxx, using the page HTML encoding of foreign language characters I got its encode is html encoding but my input string is the combination of convert HTML Entity (named) and HTML Entity (decimal).

Does anyone have any suggestion? It would be the best if we can make it without adding any additional libraries.

[UPDATE] I solved my problem by using Apache library :

String encodeString = "Ðột nhiên, ở gốc Tây Bắc văng vẳng có tiếng vó ngựa dồn dập.";
    String unEncodeString = StringEscapeUtils.unescapeHtml4(encodeString);
    System.out.println("OUTPUT : " + unEncodeString);

=====> OUTPUT : Ðột nhiên, ở gốc Tây Bắc văng vẳng có tiếng vó ngựa dồn dập.

ThaiPD
  • 3,503
  • 3
  • 30
  • 48
  • Thanks @AnubianNoob, I solved my problem with your suggestion but in additional I want to solve it with only standard lib of Java. Because with suggest in http://stackoverflow.com/questions/20799512/convert-decimal-ncrs-code-into-utf-8-in-java-jsp?answertab=active#tab-top I can convert string with prefix is "$#". Could you give a help ? Thanks alot! – ThaiPD Jan 06 '15 at 03:21

4 Answers4

2

Use Apache Commons StringEscapeUtils.unescapeHtml(string) for this.

Refer: Java: How to unescape HTML character entities in Java?

bluearrow
  • 856
  • 2
  • 11
  • 26
0
maven:
<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.9.2</version>
</dependency>    

/**
 * https://stackoverflow.com/a/6766497/8356718
 */
public static String toDecimal(String text) {
    StringBuffer sb = new StringBuffer();
    for (int i = 0; i < text.length(); i++) {
        int codePoint = text.codePointAt(i);
        // Skip over the second char in a surrogate pair
        if (codePoint > 0xffff) {
            i++;
        }
        sb.append(String.format("&#%s;", codePoint));
    }
    return sb.toString();
}

public static Document getNoPrettyDoc(String html) {
    Document doc = Jsoup.parse(html);
    doc.outputSettings().prettyPrint(false);
    return doc;
}

public static String toDecimalHtml(String html) {
    Document doc = getNoPrettyDoc(html);
    toDecimalHtml(doc);
    return doc.body().html().trim().replace("&amp;", "&");
}

private static void toDecimalHtml(Node node) {
    for (int i = 0; i < node.childNodes().size(); ) {
        Node child = node.childNode(i);
        if (child.nodeName().equals("#text")) {
            TextNode text = (TextNode) child;
            String str = text.getWholeText();
            text.text(toDecimal(str));
            if (child.childNodes().size() <= 0) {
                i++;
            }
        } else {
            if (child.childNodes().size() > 0) {
                toDecimalHtml(child);
            }
            i++;
        }
    }
}

you may need to remove: \n \r \t first

0

You might need to try this for encoding and decoding.

For encoding

URLEncoder.encode("<#> Test", "UTF-8").replace("+", "%20");

For Decoding

URLDecoder.decode("%3C%23%3E%20Test");
Faran Tariq
  • 153
  • 1
  • 8
-2

In Java, for a unicode string literal, you do \u before the number.

For example:

System.out.println("\u0042");
System.out.println("\u00AF\\_(\u30C4)_/\u00AF");

Prints:

B
¯\_(ツ)_/¯

What you want is:

System.out.println("\u00D0\u1ED9t nhi\u00EAn, \u1EDF g\u1ED1c T\u00E2y B\u1EAFc v\u0103ng v\u1EB3ng c\u00F3 ti\u1EBFng v\u00F3 ng\u1EF1a d\u1ED3n d\u1EADp.\n");

Prints:

Ðột nhiên, ở gốc Tây Bắc văng vẳng có tiếng vó ngựa dồn dập.

EDIT: Apache commons is the best way to go:

StringEscapeUtils.unescapeHtml4();.

Anubian Noob
  • 13,426
  • 6
  • 53
  • 75
  • Thank you for your answer but I mean how can I convert string "Ðột" to "Đột" string. I have existing input and I want to get output as above. Could you please help more ? – ThaiPD Jan 06 '15 at 03:05
  • is there any way with out Apache library? I want to fix it with out add-on library... – ThaiPD Jan 06 '15 at 03:24