How to decode XHTML and/or HTML5 entities in Java?

Question

I have some strings that contain XHTML character entities:

"They&apos;re quite varied"
"Sometimes the string &isin; XML standard, sometimes &isin; HTML4 standard"
"Therefore -&gt; I need an XHTML entity decoder."
"Sadly, some strings are not valid XML & are not-quite-so-valid HTML <- but I want them to work, too."

Is there any easy way to decode the entities? (I'm using Java)

I'm currently using StringEscapeUtils.unescapeHtml4(myString.replace("'", "\'")) as a temporary hack. Sadly, org.apache.commons.lang3.StringEscapeUtils has unescapeHtml4 and unescapeXML, but no unescapeXhtml.

EDIT: I do want to handle invalid XML, for example I want "&&xyzzy;" to decode to "&&xyzzy;"

EDIT: I think HTML5 has almost the same character entities as XHTML, so I think HTML 5 decoder would be fine too.

@JanDvorak: If the input was guaranteed to be **valid** XHTML, then I'd be happy. Furthermore, XML by itself doesn't have all the HTML references. — Karol S, Feb 19 '14 at 14:38
Wikipedia says [otherwise](http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references). — Sotirios Delimanolis, Feb 19 '14 at 14:41
@SotiriosDelimanolis: `'` is not a character entity reference in HTML4. — Karol S, Feb 19 '14 at 14:58
@KarolS XHTML only has the additional `apos` over HTML4, so it looks like your "temporary hack" should work. Unless it doesn't handle the errors you mention? — Mr Lister, Mar 01 '14 at 15:12
@SotiriosDelimanolis XML documents by themselves only know about the five XML entity names. XHTML documents need extra input (in the form of an XHTML doctype) to be able to handle all the HTML ones. And the HTML5 doctype in an XHTML file (so-called XHTML5) doesn't work; such documents can't handle entity names beyond the XML ones. — Mr Lister, Mar 01 '14 at 15:18

score 1 · Answer 1 · answered Feb 19 '14 at 14:34

1

This may not be directly relevant but you may wish to adopt JSoup which handles things like that albeit from a higher level. Includes web page cleaning routines.

answered Feb 19 '14 at 14:34

jmkgreen

1,633
14
22

Thanks, looks great, but in my use case it would be an overkill. – Karol S Feb 19 '14 at 14:59
There is no such thing as overkill - only problems and solutions. JSoup is a solution and a far better one than doing manual search & replaces. – Gimby Feb 19 '14 at 15:56

score 1 · Answer 2 · answered Jun 25 '21 at 01:36

Have you tried to implement a XHTMLStringEscapeUtils based on the facilities provide by org.apache.commons.text.StringEscapeUtils?

import org.apache.commons.text.StringEscapeUtils;
import org.apache.commons.text.translate.*;

public class XHTMLStringEscapeUtils {
    public static final CharSequenceTranslator ESCAPE_XHTML =
            new AggregateTranslator(
                    new LookupTranslator(EntityArrays.BASIC_ESCAPE),
                    new LookupTranslator(EntityArrays.ISO8859_1_ESCAPE),
                    new LookupTranslator(EntityArrays.HTML40_EXTENDED_ESCAPE)
            ).with(StringEscapeUtils.ESCAPE_XML11);

    public static final CharSequenceTranslator UNESCAPE_XHTML =
            new AggregateTranslator(
                    new LookupTranslator(EntityArrays.BASIC_UNESCAPE),
                    new LookupTranslator(EntityArrays.ISO8859_1_UNESCAPE),
                    new LookupTranslator(EntityArrays.HTML40_EXTENDED_UNESCAPE),
                    new NumericEntityUnescaper(),
                    new LookupTranslator(EntityArrays.APOS_UNESCAPE)
            );

    public static final String escape(final String input) {
        return ESCAPE_XHTML.translate(input);
    }

    public static final String unescape(final String input) {
        return UNESCAPE_XHTML.translate(input);
    }
}

Thanks to the modular design of Apache commons-text lib, it's easy to create custom escape utils.

You can find a full project with tests here xhtml-string-escape-utils

How to decode XHTML and/or HTML5 entities in Java?

2 Answers2