300

Is there a recommended way to escape <, >, " and & characters when outputting HTML in plain Java code? (Other than manually doing the following, that is).

String source = "The less than sign (<) and ampersand (&) must be escaped before using them in HTML";
String escaped = source.replace("<", "&lt;").replace("&", "&amp;"); // ...
Jesse Nickles
  • 1,435
  • 1
  • 17
  • 25
Ben Lings
  • 28,823
  • 13
  • 72
  • 81
  • 2
    Be aware that if you are outputting into an unquoted HTML attribute, that other characters such as space, tab, backspace, etc... can allow attackers to introduce javascript attributes without any of the characters listed. See the OWASP XSS Prevention Cheat Sheet for more. – Jeff Williams Mar 19 '14 at 17:01
  • 1
    BTW, in this code, you should escape "&" before "<" for this to work properly ("<" get replaced with "&lt;" otherwise, which is rendered as "<" then, not "<"): `source.replace("&", "&").replace("<", "<");` – Tey' Feb 23 '20 at 14:20

12 Answers12

287

StringEscapeUtils from Apache Commons Lang:

import static org.apache.commons.lang.StringEscapeUtils.escapeHtml;
// ...
String source = "The less than sign (<) and ampersand (&) must be escaped before using them in HTML";
String escaped = escapeHtml(source);

For version 3:

import static org.apache.commons.lang3.StringEscapeUtils.escapeHtml4;
// ...
String escaped = escapeHtml4(source);
Luke S.
  • 476
  • 2
  • 7
  • 12
dfa
  • 114,442
  • 31
  • 189
  • 228
  • 2
    While `StringEscapeUtils` is nice it will not escape whitespace properly for attributes if you wish to avoid HTML/XML whitespace normalization. See my answer for greater detail. – Adam Gent Aug 07 '13 at 20:28
  • StringEscapeUtils.escapeHtml() only accepts String as input, which seems unnecessarily rigid. In this modern world of JSON, some things output to the page will be numbers, for example, in which case this method breaks. – greim Apr 03 '14 at 19:48
  • 25
    The above example is broken. Use escapeHtml4() method now. – stackoverflowuser2010 Jun 24 '14 at 17:47
  • 3
    For Guava fans see [okranz's answer](http://stackoverflow.com/a/26572556/245602) below. – George Hawkins Jan 27 '15 at 12:28
  • 2
    If webpage has UTF-8 encoding then all we need is Guava's htmlEscaper that escapes only the following five ASCII characters: '"&<>. The Apache's escapeHtml() also replaces non-ASCII characters including accents which seems unnecessary with UTF-8 web pages ? – zdenekca Apr 20 '15 at 15:31
  • 1
    @greim - When might numbers contain content that needs to be escaped? – Greg Brown Mar 07 '16 at 16:49
  • @dfa- When I use escapeHtml(input), the double quot (") in my HTML string gets converted to " which I don't want. Is there any way to customize that? – Rashmi Ranjan mallick Feb 27 '17 at 06:46
  • 10
    It is now deprecated in commons-lang3. It was moved to https://commons.apache.org/proper/commons-text/ – Danny Aug 16 '17 at 14:11
  • @https://stackoverflow.com/users/597419/danny - that link does not mention escaping HTML. Can you be more specific? – Steve Staple Jul 11 '18 at 10:26
  • Here is a direct link to the latest namespace at [org.apache.commons.text.StringEscapeUtils](https://commons.apache.org/proper/commons-text/javadocs/api-release/org/apache/commons/text/StringEscapeUtils.html#escapeHtml4-java.lang.String-) as @Danny mentioned – dbudzins Jul 20 '18 at 12:36
  • NOTE: StringEscapeUtils.escapeHtml dos not escape the apostrophe character, leaving you with a big bug or even security vulnerability. Use other working tools, like Spring HtmUtils or others mentioned. – David Balažic Sep 20 '18 at 10:48
  • Use `org.apache.commons.text.StringEscapeUtils` for `apache.commons.text` – Spectric Apr 22 '21 at 02:50
164

An alternative to Apache Commons: Use Spring's HtmlUtils.htmlEscape(String input) method.

skaffman
  • 398,947
  • 96
  • 818
  • 769
Adamski
  • 54,009
  • 15
  • 113
  • 152
65

Nice short method:

public static String escapeHTML(String s) {
    StringBuilder out = new StringBuilder(Math.max(16, s.length()));
    for (int i = 0; i < s.length(); i++) {
        char c = s.charAt(i);
        if (c > 127 || c == '"' || c == '\'' || c == '<' || c == '>' || c == '&') {
            out.append("&#");
            out.append((int) c);
            out.append(';');
        } else {
            out.append(c);
        }
    }
    return out.toString();
}

Based on https://stackoverflow.com/a/8838023/1199155 (the amp is missing there). The four characters checked in the if clause are the only ones below 128, according to http://www.w3.org/TR/html4/sgml/entities.html

Aloso
  • 5,123
  • 4
  • 24
  • 41
Bruno Eberhard
  • 1,624
  • 16
  • 22
  • 1
    Nice. It doesn't use the "html versions" of the encodings (example: "á" would be "á" instead of "á"), but since the numeric ones work even in IE7 I guess I don't have to worry. Thanks. – nonzaprej Sep 04 '17 at 15:46
  • Why do you encode all that characters when the OP asked to escape the 4 relevant characters? You are wasting CPU and memory. – David Balažic Sep 20 '18 at 10:51
  • 2
    You forgot the apostrophe. So people can inject unquoted attributes everywhere where this code is used to escape attribute values. – David Balažic Sep 20 '18 at 10:59
  • this does not work when the string contains surrogate pairs, e.g. emojis. – Clashsoft Aug 14 '20 at 09:31
47

There is a newer version of the Apache Commons Lang library and it uses a different package name (org.apache.commons.lang3). The StringEscapeUtils now has different static methods for escaping different types of documents (http://commons.apache.org/proper/commons-lang/javadocs/api-3.0/index.html). So to escape HTML version 4.0 string:

import static org.apache.commons.lang3.StringEscapeUtils.escapeHtml4;

String output = escapeHtml4("The less than sign (<) and ampersand (&) must be escaped before using them in HTML");
Dawood ibn Kareem
  • 77,785
  • 15
  • 98
  • 110
Martin Dimitrov
  • 4,796
  • 5
  • 46
  • 62
46

For those who use Google Guava:

import com.google.common.html.HtmlEscapers;
[...]
String source = "The less than sign (<) and ampersand (&) must be escaped before using them in HTML";
String escaped = HtmlEscapers.htmlEscaper().escape(source);
okrasz
  • 3,866
  • 24
  • 15
42

Be careful with this. There are a number of different 'contexts' within an HTML document: Inside an element, quoted attribute value, unquoted attribute value, URL attribute, javascript, CSS, etc... You'll need to use a different encoding method for each of these to prevent Cross-Site Scripting (XSS). Check the OWASP XSS Prevention Cheat Sheet for details on each of these contexts. You can find escaping methods for each of these contexts in the OWASP ESAPI library -- https://github.com/ESAPI/esapi-java-legacy.

Miha_x64
  • 5,973
  • 1
  • 41
  • 63
Jeff Williams
  • 921
  • 7
  • 9
  • 6
    THANK YOU for pointing out that the *context* in which you wish to encode the output very much matters. The term "encode" is also a much more appropriate verb than "escape", as well. Escape implies some kind of special hack, as opposed to "how do I *encode* this string for: an XHTML attribute / SQL query parameter / PostScript print string / CSV output field? – Roboprog Apr 30 '13 at 01:07
  • 5
    'Encode' and 'escape' are both widely used to describe this. The term "escape" is generally used when the process is to add an "escape character" before a syntactically-relevant character, such as escaping a quote character with a backslash \" The term "encode" is more typically used when you translate a character into a different form, such as URL encoding the quote character %22 or HTML entity encoding as " or @quot. – Jeff Williams Mar 19 '14 at 16:58
  • http://owasp-esapi-java.googlecode.com/svn/trunk_doc/latest/index.html. link now broke – andrew pate Jan 05 '17 at 22:09
  • 1
    To save you some googling, look for the Encoder class https://static.javadoc.io/org.owasp.esapi/esapi/2.0.1/org/owasp/esapi/Encoder.html#encodeForHTMLAttribute(java.lang.String) – Jakub Bochenski Aug 12 '19 at 09:49
40

On android (API 16 or greater) you can:

Html.escapeHtml(textToScape);

or for lower API:

TextUtils.htmlEncode(textToScape);
OriolJ
  • 2,762
  • 1
  • 28
  • 22
20

For some purposes, HtmlUtils:

import org.springframework.web.util.HtmlUtils;
[...]
HtmlUtils.htmlEscapeDecimal("&"); //gives &#38;
HtmlUtils.htmlEscape("&"); //gives &amp;
Eric
  • 6,563
  • 5
  • 42
  • 66
AUU
  • 209
  • 2
  • 2
  • 1
    From the spring HtmlUtils comments: *

    For a comprehensive set of String escaping utilities, * consider Apache Commons Lang and its StringEscapeUtils class. * We are not using that class here to avoid a runtime dependency * on Commons Lang just for HTML escaping. Furthermore, Spring's * HTML escaping is more flexible and 100% HTML 4.0 compliant. If you are already using Apache commons in your project probably you should use the StringEscapeUtils from apache

    – andreyro Sep 13 '19 at 09:09
16

org.apache.commons.lang3.StringEscapeUtils is now deprecated. You must now use org.apache.commons.text.StringEscapeUtils by

    <dependency>
        <groupId>org.apache.commons</groupId>
        <artifactId>commons-text</artifactId>
        <version>${commons.text.version}</version>
    </dependency>
14

While @dfa answer of org.apache.commons.lang.StringEscapeUtils.escapeHtml is nice and I have used it in the past it should not be used for escaping HTML (or XML) attributes otherwise the whitespace will be normalized (meaning all adjacent whitespace characters become a single space).

I know this because I have had bugs filed against my library (JATL) for attributes where whitespace was not preserved. Thus I have a drop in (copy n' paste) class (of which I stole some from JDOM) that differentiates the escaping of attributes and element content.

While this may not have mattered as much in the past (proper attribute escaping) it is increasingly become of greater interest given the use use of HTML5's data- attribute usage.

Adam Gent
  • 47,843
  • 23
  • 153
  • 203
1

The most of libraries offer escaping everything they can including hundreds of symbols and thousands of non-ASCII characters which is not what you want in UTF-8 world.

Also, as Jeff Williams noted, there's no single “escape HTML” option, there are several contexts.

Assuming you never use unquoted attributes, and keeping in mind that different contexts exist, it've written my own version:

private static final long TEXT_ESCAPE =
        1L << '&' | 1L << '<';
private static final long DOUBLE_QUOTED_ATTR_ESCAPE =
        TEXT_ESCAPE | 1L << '"';
private static final long SINGLE_QUOTED_ATTR_ESCAPE =
        TEXT_ESCAPE | 1L << '\'';
private static final long ESCAPES =
        DOUBLE_QUOTED_ATTR_ESCAPE | SINGLE_QUOTED_ATTR_ESCAPE;

// 'quot' and 'apos' are 1 char longer than '#34' and '#39'
// which I've decided to use
private static final String REPLACEMENTS = "&#34;&amp;&#39;&lt;";
private static final int REPL_SLICES = /*  [0,   5,   10,  15, 19) */
        5<<5 | 10<<10 | 15<<15 | 19<<20;
// These 5-bit numbers packed into a single int
// are indices within REPLACEMENTS which is a 'flat' String[]

private static void appendEscaped(
        Appendable builder, CharSequence content, long escapes) {
    try {
        int startIdx = 0, len = content.length();
        for (int i = 0; i < len; i++) {
            char c = content.charAt(i);
            long one;
            if (((c & 63) == c) && ((one = 1L << c) & escapes) != 0) {
            // -^^^^^^^^^^^^^^^   -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
            // |                  | take only dangerous characters
            // | java shifts longs by 6 least significant bits,
            // | e. g. << 0b110111111 is same as >> 0b111111.
            // | Filter out bigger characters

                int index = Long.bitCount(ESCAPES & (one - 1));
                builder.append(content, startIdx, i /* exclusive */).append(
                        REPLACEMENTS,
                        REPL_SLICES >>> (5 * index) & 31,
                        REPL_SLICES >>> (5 * (index + 1)) & 31
                );
                startIdx = i + 1;
            }
        }
        builder.append(content, startIdx, len);
    } catch (IOException e) {
        // typically, our Appendable is StringBuilder which does not throw;
        // also, there's no way to declare 'if A#append() throws E,
        // then appendEscaped() throws E, too'
        throw new UncheckedIOException(e);
    }
}

Consider copy-pasting from Gist without line length limit.

UPD: As another answer suggests, > escaping is not necessary; also, " within attr='…' is allowed, too. I've updated the code accordingly.

You may check it out yourself:

<!DOCTYPE html>
<html lang="en">
<head><title>Test</title></head>
<body>

<p title="&lt;&#34;I'm double-quoted!&#34;>">&lt;"Hello!"></p>
<p title='&lt;"I&#39;m single-quoted!">'>&lt;"Goodbye!"></p>

</body>
</html>
Miha_x64
  • 5,973
  • 1
  • 41
  • 63
1

Java 8+ Solution:

public static String escapeHTML(String str) {
    return str.chars().mapToObj(c -> c > 127 || "\"'<>&".indexOf(c) != -1 ?
       "&#" + c + ";" : String.valueOf((char) c)).collect(Collectors.joining());
}

String#chars returns an IntStream of the char values from the String. We can then use mapToObj to escape the characters with a character code greater than 127 (non-ASCII characters) as well as the double quote ("), single quote ('), left angle bracket (<), right angle bracket (>), and ampersand (&). Collectors.joining concatenates the Strings back together.

To better handle Unicode characters, String#codePoints can be used instead.

public static String escapeHTML(String str) {
    return str.codePoints().mapToObj(c -> c > 127 || "\"'<>&".indexOf(c) != -1 ?
            "&#" + c + ";" : new String(Character.toChars(c)))
       .collect(Collectors.joining());
}
Unmitigated
  • 76,500
  • 11
  • 62
  • 80