How can I unescape HTML character entities in Java?

Question

Basically, I would like to decode a given HTML document, and replace all special characters, such as " " → " " and ">" → ">".

In .NET, we can make use of the HttpUtility.HtmlDecode method.

What's the equivalent function in Java?

is called character entity. Edited the title. – Eugene Yokota Jun 15 '09 at 02:46 — Eugene Yokota, Jun 15 '09 at 02:46

score 221 · Accepted Answer · edited Aug 30 '19 at 09:48

221

I have used the Apache Commons StringEscapeUtils.unescapeHtml4() for this:

Unescapes a string containing entity escapes to a string containing the actual Unicode characters corresponding to the escapes. Supports HTML 4.0 entities.

edited Aug 30 '19 at 09:48

Vivien

57
8

answered Jun 15 '09 at 02:43

Kevin Hakanson

41,386
23
126
155

23

Sadly I just realized today that it does not decode HTMLspecial characters very well :( – Sid Oct 13 '10 at 20:04
1

a dirty trick is to store the value initially in a hidden field to escape it, then the target field should get the value from the hidden field. – setzamora Jun 16 '11 at 05:19
3

Class StringEscapeUtils is deprecated and moved to [Apache commons-text](https://commons.apache.org/proper/commons-text/) – Pauli Dec 03 '18 at 22:16
2

I want to convert the string `
üè
` to `
üé
`, with `StringEscapeUtils.unescapeHtml4()` I get `<p>üè</p>`. Is there a way to keep existing html tags intact? – Nickkk Jan 13 '20 at 12:10
If I have something like `` which escapes to a quotation mark in Windows-1252 but some control character in Unicode, can the escaping encoding be changed? – ifly6 Dec 11 '20 at 13:21
The link is half-broken (page anchor). – Peter Mortensen May 03 '23 at 13:39

score 67 · Answer 2 · edited May 03 '23 at 13:33

67

The libraries mentioned in other answers would be fine solutions, but if you already happen to be digging through real-world HTML content in your project, the Jsoup project has a lot more to offer than just managing "ampersand pound FFFF semicolon" things.

// textValue: <p>This is a&nbsp;sample. \"Granny\" Smith &#8211;.<\/p>\r\n
// becomes this: This is a sample. "Granny" Smith –.
// with one line of code:
// Jsoup.parse(textValue).getText(); // for older versions of Jsoup
Jsoup.parse(textValue).text();

// Another possibility may be the static unescapeEntities method:
boolean strictMode = true;
String unescapedString = org.jsoup.parser.Parser.unescapeEntities(textValue, strictMode);

And you also get the convenient API for extracting and manipulating data, using the best of DOM, CSS, and jQuery-like methods. It's open source and MIT License.

edited May 03 '23 at 13:33

Peter Mortensen

30,738
21
105
131

answered May 17 '16 at 13:25

Dale

5,520
4
43
79

4

upvote+, but I should point that newer versions of Jsoup use `.text()` instead of `.getText()` – SourceVisor Nov 10 '16 at 16:25
5

Perhaps more direct is to use `org.jsoup.parser.Parser.unescapeEntities(String string, boolean inAttribute)`. API docs: https://jsoup.org/apidocs/org/jsoup/parser/Parser.html#unescapeEntities-java.lang.String-boolean- – danneu Dec 01 '16 at 21:17
3

This was perfect, since I'm already using Jsoup in my project. Also, @danneu was right - Parser.unescapeEntities works exactly as advertised. – MandisaW Aug 29 '17 at 17:23

score 46 · Answer 3 · edited May 03 '23 at 14:16

I tried Apache Commons' StringEscapeUtils.unescapeHtml3() in my project, but I wasn't satisfied with its performance. It turns out, it does a lot of unnecessary operations. For one, it allocates a StringWriter for every call, even if there's nothing to unescape in the string. I've rewritten that code differently, and now it works much faster.

The following code unescapes all HTML 3 symbols and numeric escapes (equivalent to Apache unescapeHtml3). You can just add more entries to the map if you need HTML 4.

package com.example;

import java.io.StringWriter;
import java.util.HashMap;

public class StringUtils {

    public static final String unescapeHtml3(final String input) {
        StringWriter writer = null;
        int len = input.length();
        int i = 1;
        int st = 0;
        while (true) {
            // Look for '&'
            while (i < len && input.charAt(i-1) != '&')
                i++;
            if (i >= len)
                break;

            // Found '&', look for ';'
            int j = i;
            while (j < len && j < i + MAX_ESCAPE + 1 && input.charAt(j) != ';')
                j++;
            if (j == len || j < i + MIN_ESCAPE || j == i + MAX_ESCAPE + 1) {
                i++;
                continue;
            }

            // Found escape
            if (input.charAt(i) == '#') {
                // Numeric escape
                int k = i + 1;
                int radix = 10;

                final char firstChar = input.charAt(k);
                if (firstChar == 'x' || firstChar == 'X') {
                    k++;
                    radix = 16;
                }

                try {
                    int entityValue = Integer.parseInt(input.substring(k, j), radix);

                    if (writer == null)
                        writer = new StringWriter(input.length());
                    writer.append(input.substring(st, i - 1));

                    if (entityValue > 0xFFFF) {
                        final char[] chrs = Character.toChars(entityValue);
                        writer.write(chrs[0]);
                        writer.write(chrs[1]);
                    } else {
                        writer.write(entityValue);
                    }

                } catch (NumberFormatException ex) {
                    i++;
                    continue;
                }
            }
            else {
                // Named escape
                CharSequence value = lookupMap.get(input.substring(i, j));
                if (value == null) {
                    i++;
                    continue;
                }

                if (writer == null)
                    writer = new StringWriter(input.length());
                writer.append(input.substring(st, i - 1));

                writer.append(value);
            }

            // Skip escape
            st = j + 1;
            i = st;
        }

        if (writer != null) {
            writer.append(input.substring(st, len));
            return writer.toString();
        }
        return input;
    }

    private static final String[][] ESCAPES = {
        {"\"",     "quot"}, // " - double-quote
        {"&",      "amp"}, // & - ampersand
        {"<",      "lt"}, // < - less-than
        {">",      "gt"}, // > - greater-than

        // Mapping to escape ISO-8859-1 characters to their named HTML 3.x equivalents.
        {"\u00A0", "nbsp"},   // Non-breaking space
        {"\u00A1", "iexcl"},  // Inverted exclamation mark
        {"\u00A2", "cent"},   // Cent sign
        {"\u00A3", "pound"},  // Pound sign
        {"\u00A4", "curren"}, // Currency sign
        {"\u00A5", "yen"},    // Yen sign = yuan sign
        {"\u00A6", "brvbar"}, // Broken bar = broken vertical bar
        {"\u00A7", "sect"},   // Section sign
        {"\u00A8", "uml"},    // Diaeresis = spacing diaeresis
        {"\u00A9", "copy"},   // © - copyright sign
        {"\u00AA", "ordf"},   // Feminine ordinal indicator
        {"\u00AB", "laquo"},  // Left-pointing double angle quotation mark = left pointing guillemet
        {"\u00AC", "not"},    // Not sign
        {"\u00AD", "shy"},    // Soft hyphen = discretionary hyphen
        {"\u00AE", "reg"},    // ® - registered trademark sign
        {"\u00AF", "macr"},   // Macron = spacing macron = overline = APL overbar
        {"\u00B0", "deg"},    // Degree sign
        {"\u00B1", "plusmn"}, // Plus-minus sign = plus-or-minus sign
        {"\u00B2", "sup2"},   // Superscript two = superscript digit two = squared
        {"\u00B3", "sup3"},   // Superscript three = superscript digit three = cubed
        {"\u00B4", "acute"},  // Acute accent = spacing acute
        {"\u00B5", "micro"},  // Micro sign
        {"\u00B6", "para"},   // Pilcrow sign = paragraph sign
        {"\u00B7", "middot"}, // Middle dot = Georgian comma = Greek middle dot
        {"\u00B8", "cedil"},  // Cedilla = spacing cedilla
        {"\u00B9", "sup1"},   // Superscript one = superscript digit one
        {"\u00BA", "ordm"},   // Masculine ordinal indicator
        {"\u00BB", "raquo"},  // Right-pointing double angle quotation mark = right pointing guillemet
        {"\u00BC", "frac14"}, // Vulgar fraction one quarter = fraction one quarter
        {"\u00BD", "frac12"}, // Vulgar fraction one half = fraction one half
        {"\u00BE", "frac34"}, // Vulgar fraction three quarters = fraction three quarters
        {"\u00BF", "iquest"}, // Inverted question mark = turned question mark
        {"\u00C0", "Agrave"}, // А - uppercase A, grave accent
        {"\u00C1", "Aacute"}, // Б - uppercase A, acute accent
        {"\u00C2", "Acirc"},  // В - uppercase A, circumflex accent
        {"\u00C3", "Atilde"}, // Г - uppercase A, tilde
        {"\u00C4", "Auml"},   // Д - uppercase A, umlaut
        {"\u00C5", "Aring"},  // Е - uppercase A, ring
        {"\u00C6", "AElig"},  // Ж - uppercase AE
        {"\u00C7", "Ccedil"}, // З - uppercase C, cedilla
        {"\u00C8", "Egrave"}, // И - uppercase E, grave accent
        {"\u00C9", "Eacute"}, // Й - uppercase E, acute accent
        {"\u00CA", "Ecirc"},  // К - uppercase E, circumflex accent
        {"\u00CB", "Euml"},   // Л - uppercase E, umlaut
        {"\u00CC", "Igrave"}, // М - uppercase I, grave accent
        {"\u00CD", "Iacute"}, // Н - uppercase I, acute accent
        {"\u00CE", "Icirc"},  // О - uppercase I, circumflex accent
        {"\u00CF", "Iuml"},   // П - uppercase I, umlaut
        {"\u00D0", "ETH"},    // Р - uppercase Eth, Icelandic
        {"\u00D1", "Ntilde"}, // С - uppercase N, tilde
        {"\u00D2", "Ograve"}, // Т - uppercase O, grave accent
        {"\u00D3", "Oacute"}, // У - uppercase O, acute accent
        {"\u00D4", "Ocirc"},  // Ф - uppercase O, circumflex accent
        {"\u00D5", "Otilde"}, // Х - uppercase O, tilde
        {"\u00D6", "Ouml"},   // Ц - uppercase O, umlaut
        {"\u00D7", "times"},  // Multiplication sign
        {"\u00D8", "Oslash"}, // Ш - uppercase O, slash
        {"\u00D9", "Ugrave"}, // Щ - uppercase U, grave accent
        {"\u00DA", "Uacute"}, // Ъ - uppercase U, acute accent
        {"\u00DB", "Ucirc"},  // Ы - uppercase U, circumflex accent
        {"\u00DC", "Uuml"},   // Ь - uppercase U, umlaut
        {"\u00DD", "Yacute"}, // Э - uppercase Y, acute accent
        {"\u00DE", "THORN"},  // Ю - uppercase THORN, Icelandic
        {"\u00DF", "szlig"},  // Я - lowercase sharps, German
        {"\u00E0", "agrave"}, // а - lowercase a, grave accent
        {"\u00E1", "aacute"}, // б - lowercase a, acute accent
        {"\u00E2", "acirc"},  // в - lowercase a, circumflex accent
        {"\u00E3", "atilde"}, // г - lowercase a, tilde
        {"\u00E4", "auml"},   // д - lowercase a, umlaut
        {"\u00E5", "aring"},  // е - lowercase a, ring
        {"\u00E6", "aelig"},  // ж - lowercase ae
        {"\u00E7", "ccedil"}, // з - lowercase c, cedilla
        {"\u00E8", "egrave"}, // и - lowercase e, grave accent
        {"\u00E9", "eacute"}, // й - lowercase e, acute accent
        {"\u00EA", "ecirc"},  // к - lowercase e, circumflex accent
        {"\u00EB", "euml"},   // л - lowercase e, umlaut
        {"\u00EC", "igrave"}, // м - lowercase i, grave accent
        {"\u00ED", "iacute"}, // н - lowercase i, acute accent
        {"\u00EE", "icirc"},  // о - lowercase i, circumflex accent
        {"\u00EF", "iuml"},   // п - lowercase i, umlaut
        {"\u00F0", "eth"},    // р - lowercase eth, Icelandic
        {"\u00F1", "ntilde"}, // с - lowercase n, tilde
        {"\u00F2", "ograve"}, // т - lowercase o, grave accent
        {"\u00F3", "oacute"}, // у - lowercase o, acute accent
        {"\u00F4", "ocirc"},  // ф - lowercase o, circumflex accent
        {"\u00F5", "otilde"}, // х - lowercase o, tilde
        {"\u00F6", "ouml"},   // ц - lowercase o, umlaut
        {"\u00F7", "divide"}, // Division sign
        {"\u00F8", "oslash"}, // ш - lowercase o, slash
        {"\u00F9", "ugrave"}, // щ - lowercase u, grave accent
        {"\u00FA", "uacute"}, // ъ - lowercase u, acute accent
        {"\u00FB", "ucirc"},  // ы - lowercase u, circumflex accent
        {"\u00FC", "uuml"},   // ь - lowercase u, umlaut
        {"\u00FD", "yacute"}, // э - lowercase y, acute accent
        {"\u00FE", "thorn"},  // ю - lowercase thorn, Icelandic
        {"\u00FF", "yuml"},   // я - lowercase y, umlaut
    };

    private static final int MIN_ESCAPE = 2;
    private static final int MAX_ESCAPE = 6;

    private static final HashMap<String, CharSequence> lookupMap;
    static {
        lookupMap = new HashMap<String, CharSequence>();
        for (final CharSequence[] seq : ESCAPES)
            lookupMap.put(seq[1].toString(), seq[0]);
    }

}

Recently, I had to optimize a slow Struts project. It turned out that under the cover Struts calls Apache for html string escaping by default (``). Turning off escaping (``) got some pages to run 5% to 20% faster. — Stephan, Jul 13 '14 at 22:10
Later I found out that this code can enter loop when given empty string as argument. Current edition has that problem fixed. — Nick Frolov, Sep 17 '14 at 06:20
Does this escape or unespace? & is not decoded. Only & is added to the map, so it only works one way? — mjs, Jan 31 '15 at 19:29
A StringWriter uses a StringBuffer internally which uses locking. Using a StringBuilder directly should be faster. — Axel Dörfler, Feb 22 '16 at 12:40
found a bug in the above code when encountering "=" aka =. writer.write(entityValue); should be writer.write(Character.toString((char)entityValue)); – Stevko 4 hours ago — Stevko, May 16 '16 at 23:54
@NickFrolov, your comments seem a bit messed up. `auml` is for instance `ä` and not `д`. — aioobe, Oct 17 '16 at 01:52
Improved version with all HTML5 characters: https://gist.github.com/MarkJeronimus/798c452582e64410db769933ec71cfb7 — Mark Jeronimus, Jun 22 '20 at 12:26
v2 in my gist (link above ↑). Works the same but has smaller class file footprint and shorter compile time. If there are issues, v1 is in the gist edit history. — Mark Jeronimus, Jul 09 '20 at 14:59

Stephan · Answer 4 · 2016-07-27T12:02:45.893

17

The following library can also be used for HTML escaping in Java: unbescape.

HTML can be unescaped this way:

final String unescapedText = HtmlEscape.unescapeHtml(escapedText);

edited Jul 27 '16 at 12:02

answered Jul 13 '14 at 22:59

Stephan

41,764
65
238
329

2

It did nothing to this: `%3Chtml%3E%0D%0A%3Chead%3E%0D%0A%3Ctitle%3Etest%3C%2Ftitle%3E%0D%0A%3C%2Fhead%3E%0D%0A%3Cbody%3E%0D%0Atest%0D%0A%3C%2Fbody%3E%0D%0A%3C%2Fhtml%3E` – Aug 27 '15 at 16:33
45

@ThreaT Your text is not html-encoded, it is url-encoded. – Mikhail Batcer Oct 28 '15 at 07:23

score 16 · Answer 5 · answered May 14 '20 at 09:10

16

Spring Framework HtmlUtils

If you're using Spring framework already, use the following method:

import static org.springframework.web.util.HtmlUtils.htmlUnescape;

...

String result = htmlUnescape(source);

answered May 14 '20 at 09:10

herman

11,740
5
47
58

score 12 · Answer 6 · edited May 03 '23 at 14:21

12

This did the job for me,

import org.apache.commons.lang.StringEscapeUtils;
...
String decodedXML = StringEscapeUtils.unescapeHtml(encodedXML);

Or

import org.apache.commons.lang3.StringEscapeUtils;
...
String decodedXML = StringEscapeUtils.unescapeHtml4(encodedXML);

I guess it’s always better to use the lang3 for obvious reasons.

edited May 03 '23 at 14:21

Peter Mortensen

30,738
21
105
131

answered Apr 19 '17 at 02:31

tk_

16,415
8
80
90

score 4 · Answer 7 · edited May 03 '23 at 14:22

4

A very simple, but inefficient solution without any external library is:

public static String unescapeHtml3(String str) {
    try {
        HTMLDocument doc = new HTMLDocument();
        new HTMLEditorKit().read(new StringReader("<html><body>" + str), doc, 0);
        return doc.getText(1, doc.getLength());
    } catch(Exception ex) {
        return str;
    }
}

This should be used only if you have only small count of string to decode.

edited May 03 '23 at 14:22

Peter Mortensen

30,738
21
105
131

answered Dec 03 '16 at 22:07

Horcrux7

23,758
21
98
156

1

Very close, but not exact - it converted "qwAS12ƷƸǅǚǪǼȌ" to "qwAS12ƷƸǅǚǪǼȌ\n". – Greg Jul 16 '18 at 17:21

score 3 · Answer 8 · edited Sep 12 '17 at 21:43

3

The most reliable way is with

String cleanedString = StringEscapeUtils.unescapeHtml4(originalString);

from org.apache.commons.lang3.StringEscapeUtils.

And to escape the whitespaces

cleanedString = cleanedString.trim();

This will ensure that whitespaces due to copy and paste in web forms to not get persisted in DB.

edited Sep 12 '17 at 21:43

Floern

33,559
24
104
119

answered Sep 12 '17 at 21:16

mike oganyan

137
5

score 1 · Answer 9 · answered Sep 09 '21 at 12:07

StringEscapeUtils (Apache Commons Lang)
Escapes and unescapes Strings for Java, JavaScript, HTML, and XML.

import org.apache.commons.lang.StringEscapeUtils;
....
StringEscapeUtils.unescapeHtml(comment);

Reference: https://commons.apache.org/proper/commons-text/javadocs/api-release/org/apache/commons/text/StringEscapeUtils.html

score 0 · Answer 10 · edited May 03 '23 at 13:45

0

Consider using the HtmlManipulator Java class. You may need to add some items (not all entities are in the list).

The Apache Commons StringEscapeUtils as suggested by Kevin Hakanson did not work 100% for me; several entities, like &#145 (left single quote) were translated into '222' somehow. I also tried org.jsoup, and had the same problem.

edited May 03 '23 at 13:45

Peter Mortensen

30,738
21
105
131

answered Jun 03 '14 at 23:25

Joost

141
8

222 is likely in octal (hexadecimal 0x92. decimal 146). In [Windows-1252](https://en.wikipedia.org/wiki/Windows-1252#Codepage_layout) (but not in [ISO 8859-1](https://en.wikipedia.org/wiki/ISO/IEC_8859-1)), 0x92 corresponds to U+2019 ([RIGHT SINGLE QUOTATION MARK](https://www.utf8-chartable.de/unicode-utf8-table.pl?start=8192&number=128)). Are you sure it is not octal 221? Or right single quote? – Peter Mortensen May 03 '23 at 14:04

score 0 · Answer 11 · edited May 03 '23 at 14:19

In my case, I use the replace method by testing every entity in every variable. My code looks like this:

text = text.replace("&Ccedil;", "Ç");
text = text.replace("&ccedil;", "ç");
text = text.replace("&Aacute;", "Á");
text = text.replace("&Acirc;", "Â");
text = text.replace("&Atilde;", "Ã");
text = text.replace("&Eacute;", "É");
text = text.replace("&Ecirc;", "Ê");
text = text.replace("&Iacute;", "Í");
text = text.replace("&Ocirc;", "Ô");
text = text.replace("&Otilde;", "Õ");
text = text.replace("&Oacute;", "Ó");
text = text.replace("&Uacute;", "Ú");
text = text.replace("&aacute;", "á");
text = text.replace("&acirc;", "â");
text = text.replace("&atilde;", "ã");
text = text.replace("&eacute;", "é");
text = text.replace("&ecirc;", "ê");
text = text.replace("&iacute;", "í");
text = text.replace("&ocirc;", "ô");
text = text.replace("&otilde;", "õ");
text = text.replace("&oacute;", "ó");
text = text.replace("&uacute;", "ú");

In my case this worked very well.

This isn't every special entity. Even the two mentioned in the question are missing. — Sandy Gifford, Oct 27 '16 at 15:27

score -7 · Answer 12 · edited May 03 '23 at 13:42

In case you want to mimic what PHP function htmlspecialchars_decode() does, use PHP function get_html_translation_table() to dump the table and then use the Java code like,

static Map<String, String> html_specialchars_table = new Hashtable<String, String>();

static {
    html_specialchars_table.put("&lt;", "<");
    html_specialchars_table.put("&gt;", ">");
    html_specialchars_table.put("&amp;", "&");
}

static String htmlspecialchars_decode_ENT_NOQUOTES(String s) {
    Enumeration en = html_specialchars_table.keys();
    while(en.hasMoreElements()) {
        String key = en.nextElement();
        String val = html_specialchars_table.get(key);
        s = s.replaceAll(key, val);
    }
    return s;
}

How can I unescape HTML character entities in Java?

12 Answers12

Spring Framework HtmlUtils

Linked

Related