How to convert string with html encoding to Unicode in java

Question

I have a string with HTML encoding like below:

&ETH;&#7897;t nhi&ecirc;n, &#7903; g&#7889;c T&acirc;y B&#7855;c v&#259;ng v&#7859;ng c&oacute; ti&#7871;ng v&oacute; ng&#7921;a d&#7891;n d&#7853;p.

I want to convert this String to Unicode. Expected output:

Ðột nhiên, ở gốc Tây Bắc văng vẳng có tiếng vó ngựa dồn dập.

I found a solution by Convert Decimal NCRs Code into UTF-8 in java (JSP) but it only works for strings with all characters which has its format begins with &#.

With characters begin with &xxxx, using the page HTML encoding of foreign language characters I got its encode is html encoding but my input string is the combination of convert HTML Entity (named) and HTML Entity (decimal).

Does anyone have any suggestion? It would be the best if we can make it without adding any additional libraries.

[UPDATE] I solved my problem by using Apache library :

String encodeString = "&ETH;&#7897;t nhi&ecirc;n, &#7903; g&#7889;c T&acirc;y B&#7855;c v&#259;ng v&#7859;ng c&oacute; ti&#7871;ng v&oacute; ng&#7921;a d&#7891;n d&#7853;p.";
    String unEncodeString = StringEscapeUtils.unescapeHtml4(encodeString);
    System.out.println("OUTPUT : " + unEncodeString);

=====> OUTPUT : Ðột nhiên, ở gốc Tây Bắc văng vẳng có tiếng vó ngựa dồn dập.

Thanks @AnubianNoob, I solved my problem with your suggestion but in additional I want to solve it with only standard lib of Java. Because with suggest in http://stackoverflow.com/questions/20799512/convert-decimal-ncrs-code-into-utf-8-in-java-jsp?answertab=active#tab-top I can convert string with prefix is "$#". Could you give a help ? Thanks alot! — ThaiPD, Jan 06 '15 at 03:21

score 2 · Answer 1 · answered Apr 26 '18 at 02:43

2

Use Apache Commons StringEscapeUtils.unescapeHtml(string) for this.

Refer: Java: How to unescape HTML character entities in Java?

answered Apr 26 '18 at 02:43

bluearrow

856
2
11
26

score 0 · Answer 2 · 2018-02-07T03:05:13.283

maven:
<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.9.2</version>
</dependency>    

/**
 * https://stackoverflow.com/a/6766497/8356718
 */
public static String toDecimal(String text) {
    StringBuffer sb = new StringBuffer();
    for (int i = 0; i < text.length(); i++) {
        int codePoint = text.codePointAt(i);
        // Skip over the second char in a surrogate pair
        if (codePoint > 0xffff) {
            i++;
        }
        sb.append(String.format("&#%s;", codePoint));
    }
    return sb.toString();
}

public static Document getNoPrettyDoc(String html) {
    Document doc = Jsoup.parse(html);
    doc.outputSettings().prettyPrint(false);
    return doc;
}

public static String toDecimalHtml(String html) {
    Document doc = getNoPrettyDoc(html);
    toDecimalHtml(doc);
    return doc.body().html().trim().replace("&amp;", "&");
}

private static void toDecimalHtml(Node node) {
    for (int i = 0; i < node.childNodes().size(); ) {
        Node child = node.childNode(i);
        if (child.nodeName().equals("#text")) {
            TextNode text = (TextNode) child;
            String str = text.getWholeText();
            text.text(toDecimal(str));
            if (child.childNodes().size() <= 0) {
                i++;
            }
        } else {
            if (child.childNodes().size() > 0) {
                toDecimalHtml(child);
            }
            i++;
        }
    }
}

you may need to remove: \n \r \t first

score 0 · Answer 3 · answered Jun 18 '19 at 05:44

0

You might need to try this for encoding and decoding.

For encoding

URLEncoder.encode("<#> Test", "UTF-8").replace("+", "%20");

For Decoding

URLDecoder.decode("%3C%23%3E%20Test");

answered Jun 18 '19 at 05:44

Faran Tariq

153
1
8

Anubian Noob · Accepted Answer · 2015-01-06T03:10:56.983

-2

In Java, for a unicode string literal, you do \u before the number.

For example:

System.out.println("\u0042");
System.out.println("\u00AF\\_(\u30C4)_/\u00AF");

Prints:

B
¯\_(ツ)_/¯

What you want is:

System.out.println("\u00D0\u1ED9t nhi\u00EAn, \u1EDF g\u1ED1c T\u00E2y B\u1EAFc v\u0103ng v\u1EB3ng c\u00F3 ti\u1EBFng v\u00F3 ng\u1EF1a d\u1ED3n d\u1EADp.\n");

Prints:

Ðột nhiên, ở gốc Tây Bắc văng vẳng có tiếng vó ngựa dồn dập.

EDIT: Apache commons is the best way to go:

StringEscapeUtils.unescapeHtml4();.

edited Jan 06 '15 at 03:10

answered Jan 06 '15 at 02:58

Anubian Noob

13,426
6
53
75

Thank you for your answer but I mean how can I convert string "Ðột" to "Đột" string. I have existing input and I want to get output as above. Could you please help more ? – ThaiPD Jan 06 '15 at 03:05
is there any way with out Apache library? I want to fix it with out add-on library... – ThaiPD Jan 06 '15 at 03:24

How to convert string with html encoding to Unicode in java

4 Answers4