What is the recommended way to escape HTML symbols in plain Java?

Question

Is there a recommended way to escape <, >, " and & characters when outputting HTML in plain Java code? (Other than manually doing the following, that is).

String source = "The less than sign (<) and ampersand (&) must be escaped before using them in HTML";
String escaped = source.replace("<", "&lt;").replace("&", "&amp;"); // ...

Be aware that if you are outputting into an unquoted HTML attribute, that other characters such as space, tab, backspace, etc... can allow attackers to introduce javascript attributes without any of the characters listed. See the OWASP XSS Prevention Cheat Sheet for more. — Jeff Williams, Mar 19 '14 at 17:01
BTW, in this code, you should escape "&" before "<" for this to work properly ("<" get replaced with "<" otherwise, which is rendered as "<" then, not "<"): `source.replace("&", "&").replace("<", "<");` — Tey', Feb 23 '20 at 14:20

score 287 · Answer 1 · edited Aug 04 '15 at 14:35

287

StringEscapeUtils from Apache Commons Lang:

import static org.apache.commons.lang.StringEscapeUtils.escapeHtml;
// ...
String source = "The less than sign (<) and ampersand (&) must be escaped before using them in HTML";
String escaped = escapeHtml(source);

For version 3:

import static org.apache.commons.lang3.StringEscapeUtils.escapeHtml4;
// ...
String escaped = escapeHtml4(source);

edited Aug 04 '15 at 14:35

Luke S.

476
2
7
12

answered Aug 12 '09 at 10:00

dfa

114,442
31
189
228

2

While `StringEscapeUtils` is nice it will not escape whitespace properly for attributes if you wish to avoid HTML/XML whitespace normalization. See my answer for greater detail. – Adam Gent Aug 07 '13 at 20:28
StringEscapeUtils.escapeHtml() only accepts String as input, which seems unnecessarily rigid. In this modern world of JSON, some things output to the page will be numbers, for example, in which case this method breaks. – greim Apr 03 '14 at 19:48
25

The above example is broken. Use escapeHtml4() method now. – stackoverflowuser2010 Jun 24 '14 at 17:47
3

For Guava fans see [okranz's answer](http://stackoverflow.com/a/26572556/245602) below. – George Hawkins Jan 27 '15 at 12:28
2

If webpage has UTF-8 encoding then all we need is Guava's htmlEscaper that escapes only the following five ASCII characters: '"&<>. The Apache's escapeHtml() also replaces non-ASCII characters including accents which seems unnecessary with UTF-8 web pages ? – zdenekca Apr 20 '15 at 15:31
1

@greim - When might numbers contain content that needs to be escaped? – Greg Brown Mar 07 '16 at 16:49
@dfa- When I use escapeHtml(input), the double quot (") in my HTML string gets converted to " which I don't want. Is there any way to customize that? – Rashmi Ranjan mallick Feb 27 '17 at 06:46
10

It is now deprecated in commons-lang3. It was moved to https://commons.apache.org/proper/commons-text/ – Danny Aug 16 '17 at 14:11
@https://stackoverflow.com/users/597419/danny - that link does not mention escaping HTML. Can you be more specific? – Steve Staple Jul 11 '18 at 10:26
Here is a direct link to the latest namespace at [org.apache.commons.text.StringEscapeUtils](https://commons.apache.org/proper/commons-text/javadocs/api-release/org/apache/commons/text/StringEscapeUtils.html#escapeHtml4-java.lang.String-) as @Danny mentioned – dbudzins Jul 20 '18 at 12:36
NOTE: StringEscapeUtils.escapeHtml dos not escape the apostrophe character, leaving you with a big bug or even security vulnerability. Use other working tools, like Spring HtmUtils or others mentioned. – David Balažic Sep 20 '18 at 10:48
Use `org.apache.commons.text.StringEscapeUtils` for `apache.commons.text` – Spectric Apr 22 '21 at 02:50

score 164 · Answer 2 · edited Aug 12 '09 at 10:23

164

An alternative to Apache Commons: Use Spring's HtmlUtils.htmlEscape(String input) method.

edited Aug 12 '09 at 10:23

skaffman

398,947
96
818
769

answered Aug 12 '09 at 10:22

Adamski

54,009
15
113
152

11

Thanks. I've used it (instead of `StringEscapeUtils.escapeHtml()` from `apache-commons` 2.6) because it leaves Russian characters as is. – Slava Semushin Jul 30 '12 at 13:18
7

That's good to know. TBH I give Apache stuff a wide berth these days. – Adamski Jul 31 '12 at 08:23
1

I've used it, too, it leaves Chinese characters as is, too. – vr3C Jun 09 '15 at 10:31
2

And it also encodes the apostrophe, so it is actually useful, unlike apache StringEscapeUtils – David Balažic Sep 20 '18 at 10:50

score 65 · Answer 3 · edited Apr 17 '20 at 15:23

65

Nice short method:

public static String escapeHTML(String s) {
    StringBuilder out = new StringBuilder(Math.max(16, s.length()));
    for (int i = 0; i < s.length(); i++) {
        char c = s.charAt(i);
        if (c > 127 || c == '"' || c == '\'' || c == '<' || c == '>' || c == '&') {
            out.append("&#");
            out.append((int) c);
            out.append(';');
        } else {
            out.append(c);
        }
    }
    return out.toString();
}

Based on https://stackoverflow.com/a/8838023/1199155 (the amp is missing there). The four characters checked in the if clause are the only ones below 128, according to http://www.w3.org/TR/html4/sgml/entities.html

edited Apr 17 '20 at 15:23

Aloso

5,123
4
24
41

answered Aug 10 '14 at 12:12

Bruno Eberhard

1,624
16
22

1

Nice. It doesn't use the "html versions" of the encodings (example: "á" would be "á" instead of "á"), but since the numeric ones work even in IE7 I guess I don't have to worry. Thanks. – nonzaprej Sep 04 '17 at 15:46
Why do you encode all that characters when the OP asked to escape the 4 relevant characters? You are wasting CPU and memory. – David Balažic Sep 20 '18 at 10:51
2

You forgot the apostrophe. So people can inject unquoted attributes everywhere where this code is used to escape attribute values. – David Balažic Sep 20 '18 at 10:59
this does not work when the string contains surrogate pairs, e.g. emojis. – Clashsoft Aug 14 '20 at 09:31

score 47 · Answer 4 · edited Sep 23 '13 at 05:14

47

There is a newer version of the Apache Commons Lang library and it uses a different package name (org.apache.commons.lang3). The StringEscapeUtils now has different static methods for escaping different types of documents (http://commons.apache.org/proper/commons-lang/javadocs/api-3.0/index.html). So to escape HTML version 4.0 string:

import static org.apache.commons.lang3.StringEscapeUtils.escapeHtml4;

String output = escapeHtml4("The less than sign (<) and ampersand (&) must be escaped before using them in HTML");

edited Sep 23 '13 at 05:14

Dawood ibn Kareem

77,785
15
98
110

answered Jul 19 '11 at 14:58

Martin Dimitrov

4,796
5
46
62

3

Unfortunately nothing exists for HTML 5, nor do the Apache documents specify if it is proper to use escapeHtml4 for HTML 5. – Paul Vincent Craven Jul 23 '15 at 14:08
`escapeHtml4` has been moved to org.apache.commons.text.StringEscapeUtils. – Mike Lowery Mar 13 '23 at 00:51

score 46 · Answer 5 · answered Oct 26 '14 at 11:40

46

For those who use Google Guava:

import com.google.common.html.HtmlEscapers;
[...]
String source = "The less than sign (<) and ampersand (&) must be escaped before using them in HTML";
String escaped = HtmlEscapers.htmlEscaper().escape(source);

answered Oct 26 '14 at 11:40

okrasz

3,866
24
15

Where can I use the same, but for unescape? – android developer May 14 '23 at 13:03

score 42 · Answer 6 · edited Apr 14 '20 at 19:22

42

Be careful with this. There are a number of different 'contexts' within an HTML document: Inside an element, quoted attribute value, unquoted attribute value, URL attribute, javascript, CSS, etc... You'll need to use a different encoding method for each of these to prevent Cross-Site Scripting (XSS). Check the OWASP XSS Prevention Cheat Sheet for details on each of these contexts. You can find escaping methods for each of these contexts in the OWASP ESAPI library -- https://github.com/ESAPI/esapi-java-legacy.

edited Apr 14 '20 at 19:22

Miha_x64

5,973
1
41
63

answered Feb 15 '13 at 17:37

Jeff Williams

921
7
9

6

THANK YOU for pointing out that the *context* in which you wish to encode the output very much matters. The term "encode" is also a much more appropriate verb than "escape", as well. Escape implies some kind of special hack, as opposed to "how do I *encode* this string for: an XHTML attribute / SQL query parameter / PostScript print string / CSV output field? – Roboprog Apr 30 '13 at 01:07
5

'Encode' and 'escape' are both widely used to describe this. The term "escape" is generally used when the process is to add an "escape character" before a syntactically-relevant character, such as escaping a quote character with a backslash \" The term "encode" is more typically used when you translate a character into a different form, such as URL encoding the quote character %22 or HTML entity encoding as " or @quot. – Jeff Williams Mar 19 '14 at 16:58
http://owasp-esapi-java.googlecode.com/svn/trunk_doc/latest/index.html. link now broke – andrew pate Jan 05 '17 at 22:09
1

To save you some googling, look for the Encoder class https://static.javadoc.io/org.owasp.esapi/esapi/2.0.1/org/owasp/esapi/Encoder.html#encodeForHTMLAttribute(java.lang.String) – Jakub Bochenski Aug 12 '19 at 09:49

score 40 · Answer 7 · answered Apr 05 '13 at 09:41

40

On android (API 16 or greater) you can:

Html.escapeHtml(textToScape);

or for lower API:

TextUtils.htmlEncode(textToScape);

answered Apr 05 '13 at 09:41

OriolJ

2,762
1
28
22

2

See also my [my question](http://stackoverflow.com/questions/35104032/whats-the-difference-between-androids-html-escapehtml-and-textutils-htmlencode) about the difference between these two. (@Muz ) – Jonas Czech Feb 16 '16 at 14:53
What about unescape? – android developer May 14 '23 at 13:04

score 20 · Answer 8 · edited Jul 05 '19 at 23:37

20

For some purposes, HtmlUtils:

import org.springframework.web.util.HtmlUtils;
[...]
HtmlUtils.htmlEscapeDecimal("&"); //gives &#38;
HtmlUtils.htmlEscape("&"); //gives &amp;

edited Jul 05 '19 at 23:37

Eric

6,563
5
42
66

answered May 19 '10 at 12:12

AUU

209
2
2

1

From the spring HtmlUtils comments: *
For a comprehensive set of String escaping utilities, * consider Apache Commons Lang and its StringEscapeUtils class. * We are not using that class here to avoid a runtime dependency * on Commons Lang just for HTML escaping. Furthermore, Spring's * HTML escaping is more flexible and 100% HTML 4.0 compliant. If you are already using Apache commons in your project probably you should use the StringEscapeUtils from apache
– andreyro Sep 13 '19 at 09:09

score 16 · Answer 9 · answered May 30 '18 at 09:54

org.apache.commons.lang3.StringEscapeUtils is now deprecated. You must now use org.apache.commons.text.StringEscapeUtils by

    <dependency>
        <groupId>org.apache.commons</groupId>
        <artifactId>commons-text</artifactId>
        <version>${commons.text.version}</version>
    </dependency>

score 14 · Answer 10 · answered Aug 07 '13 at 20:26

While @dfa answer of org.apache.commons.lang.StringEscapeUtils.escapeHtml is nice and I have used it in the past it should not be used for escaping HTML (or XML) attributes otherwise the whitespace will be normalized (meaning all adjacent whitespace characters become a single space).

I know this because I have had bugs filed against my library (JATL) for attributes where whitespace was not preserved. Thus I have a drop in (copy n' paste) class (of which I stole some from JDOM) that differentiates the escaping of attributes and element content.

While this may not have mattered as much in the past (proper attribute escaping) it is increasingly become of greater interest given the use use of HTML5's data- attribute usage.

Miha_x64 · Answer 11 · 2021-10-27T09:48:13.863

The most of libraries offer escaping everything they can including hundreds of symbols and thousands of non-ASCII characters which is not what you want in UTF-8 world.

Also, as Jeff Williams noted, there's no single “escape HTML” option, there are several contexts.

Assuming you never use unquoted attributes, and keeping in mind that different contexts exist, it've written my own version:

private static final long TEXT_ESCAPE =
        1L << '&' | 1L << '<';
private static final long DOUBLE_QUOTED_ATTR_ESCAPE =
        TEXT_ESCAPE | 1L << '"';
private static final long SINGLE_QUOTED_ATTR_ESCAPE =
        TEXT_ESCAPE | 1L << '\'';
private static final long ESCAPES =
        DOUBLE_QUOTED_ATTR_ESCAPE | SINGLE_QUOTED_ATTR_ESCAPE;

// 'quot' and 'apos' are 1 char longer than '#34' and '#39'
// which I've decided to use
private static final String REPLACEMENTS = "&#34;&amp;&#39;&lt;";
private static final int REPL_SLICES = /*  [0,   5,   10,  15, 19) */
        5<<5 | 10<<10 | 15<<15 | 19<<20;
// These 5-bit numbers packed into a single int
// are indices within REPLACEMENTS which is a 'flat' String[]

private static void appendEscaped(
        Appendable builder, CharSequence content, long escapes) {
    try {
        int startIdx = 0, len = content.length();
        for (int i = 0; i < len; i++) {
            char c = content.charAt(i);
            long one;
            if (((c & 63) == c) && ((one = 1L << c) & escapes) != 0) {
            // -^^^^^^^^^^^^^^^   -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
            // |                  | take only dangerous characters
            // | java shifts longs by 6 least significant bits,
            // | e. g. << 0b110111111 is same as >> 0b111111.
            // | Filter out bigger characters

                int index = Long.bitCount(ESCAPES & (one - 1));
                builder.append(content, startIdx, i /* exclusive */).append(
                        REPLACEMENTS,
                        REPL_SLICES >>> (5 * index) & 31,
                        REPL_SLICES >>> (5 * (index + 1)) & 31
                );
                startIdx = i + 1;
            }
        }
        builder.append(content, startIdx, len);
    } catch (IOException e) {
        // typically, our Appendable is StringBuilder which does not throw;
        // also, there's no way to declare 'if A#append() throws E,
        // then appendEscaped() throws E, too'
        throw new UncheckedIOException(e);
    }
}

Consider copy-pasting from Gist without line length limit.

UPD: As another answer suggests, > escaping is not necessary; also, " within attr='…' is allowed, too. I've updated the code accordingly.

You may check it out yourself:

<!DOCTYPE html>
<html lang="en">
<head><title>Test</title></head>
<body>

<p title="&lt;&#34;I'm double-quoted!&#34;>">&lt;"Hello!"></p>
<p title='&lt;"I&#39;m single-quoted!">'>&lt;"Goodbye!"></p>

</body>
</html>

Unmitigated · Answer 12 · 2021-03-02T18:01:00.417

Java 8+ Solution:

public static String escapeHTML(String str) {
    return str.chars().mapToObj(c -> c > 127 || "\"'<>&".indexOf(c) != -1 ?
       "&#" + c + ";" : String.valueOf((char) c)).collect(Collectors.joining());
}

String#chars returns an IntStream of the char values from the String. We can then use mapToObj to escape the characters with a character code greater than 127 (non-ASCII characters) as well as the double quote ("), single quote ('), left angle bracket (<), right angle bracket (>), and ampersand (&). Collectors.joining concatenates the Strings back together.

To better handle Unicode characters, String#codePoints can be used instead.

public static String escapeHTML(String str) {
    return str.codePoints().mapToObj(c -> c > 127 || "\"'<>&".indexOf(c) != -1 ?
            "&#" + c + ";" : new String(Character.toChars(c)))
       .collect(Collectors.joining());
}

What is the recommended way to escape HTML symbols in plain Java?

12 Answers12

Java 8+ Solution:

Linked

Related