33

I am of course familiar with the java.net.URLEncoder and java.net.URLDecoder classes. However, I only need HTML-style encoding. (I don't want ' ' replaced with '+', etc). I am not aware of any JDK built in class that will do just HTML encoding. Is there one? I am aware of other choices (for example, Jakarta Commons Lang 'StringEscapeUtils', but I don't want to add another external dependency to the project where I need this.

I'm hoping that something has been added to a recent JDK (aka 5 or 6) that will do this that I don't know about. Otherwise I have to roll my own.

Jon Onstott
  • 13,499
  • 16
  • 80
  • 133
Eddie
  • 53,828
  • 22
  • 125
  • 145

7 Answers7

45

There isn't a JDK built in class to do this, but it is part of the Jakarta commons-lang library.

String escaped = StringEscapeUtils.escapeHtml3(stringToEscape);
String escaped = StringEscapeUtils.escapeHtml4(stringToEscape);

Check out the JavaDoc

Adding the dependency is usually as simple as dropping the jar somewhere, and commons-lang has so many useful utilities that it is often worthwhile having it on board.

johnmcase
  • 1,769
  • 2
  • 16
  • 27
  • 9
    As I said in a comment to another answer, adding a dependency is *NOT* as simple as dropping a JAR somewhere. Lawyers need to go over the license for the 3rd party JAR, installers need to be changed, and so on. It's not always trivial. – Eddie Mar 17 '09 at 22:11
  • 3
    I also don't like the notion of taking a dependency for a single method. – Mohamed Nuur Mar 10 '11 at 02:06
  • 2
    Please note that your method signature above is wrong. the HTML should have a lowercase tml `String escaped = StringEscapeUtils.escapeHtml(stringToEscape);` – Eric Aug 12 '11 at 21:30
  • Is it possible to only escape special characters? – ziggy Feb 17 '14 at 20:40
  • 2
    Deprecated in 3.6. Use org.apache.commons.text.StringEscapeUtils instead. – Jeremiah Adams Jul 14 '17 at 18:33
14

A simple way seem to be this one:

/**
 * HTML encode of UTF8 string i.e. symbols with code more than 127 aren't encoded
 * Use Apache Commons Text StringEscapeUtils if it is possible
 *
 * <pre>
 * escapeHtml("\tIt's timeto hack & fun\r<script>alert(\"PWNED\")</script>")
 *    .equals("&#9;It&#39;s time to hack &amp; fun&#13;&lt;script&gt;alert(&quot;PWNED&quot;)&lt;/script&gt;")
 * </pre>
 */
public static String escapeHtml(String rawHtml) {
    int rawHtmlLength = rawHtml.length();
    // add 30% for additional encodings
    int capacity = (int) (rawHtmlLength * 1.3);
    StringBuilder sb = new StringBuilder(capacity);
    for (int i = 0; i < rawHtmlLength; i++) {
        char ch = rawHtml.charAt(i);
        if (ch == '<') {
            sb.append("&lt;");
        } else if (ch == '>') {
            sb.append("&gt;");
        } else if (ch == '"') {
            sb.append("&quot;");
        } else if (ch == '&') {
            sb.append("&amp;");
        } else if (ch < ' ' || ch == '\'') {
            // non printable ascii symbols escaped as numeric entity
            // single quote ' in html doesn't have &apos; so show it as numeric entity &#39;
            sb.append("&#").append((int)ch).append(';');
        } else {
            // any non ASCII char i.e. upper than 127 is still UTF
            sb.append(ch);
        }
    }
    return sb.toString();
}

But if you do need to escape all non ASCII symbols i.e. you'll transmit encoded text on 7bit encoding then replace the last else with:

        } else {
            // encode non ASCII characters if needed
            int c = (ch & 0xFFFF);
            if (c > 127) {
                sb.append("&#").append(c).append(';');
            } else {
                sb.append(ch);
            }
        }
Sergey Ponomarev
  • 2,947
  • 1
  • 33
  • 43
Rawton Evolekam
  • 153
  • 2
  • 7
  • 1
    I think you should also have a check for & - isn't that 38? – Rob Grant Apr 30 '14 at 12:03
  • This will function but it is not accurate to the specification. Instead of expressing the character numeric codes, the following must be encoded to their specified entities: < -> < " -> " and & -> & – Douglas Held Mar 24 '16 at 17:05
  • You also forgot the apostrophe. Which is the reason to never write your own security (escaping HTML is often security related, think XSS) code when there are working existing solutions. Like [HtmlUtils.htmlEscape(String)](https://docs.spring.io/spring-framework/docs/current/javadoc-api/org/springframework/web/util/HtmlUtils.html#htmlEscape-java.lang.String-) – David Balažic Sep 20 '18 at 10:57
  • FYI: your sample was rewritten for another question https://stackoverflow.com/a/25228492/1049542 with important note "the amp is missing there" – Sergey Ponomarev Sep 14 '20 at 11:21
  • A similar solution fron JDK authors [jdk.test.lib.hprof.util.Misc#encodeHtml](https://github.com/openjdk/jdk/blob/6bab0f539fba8fb441697846347597b4a0ade428/test/lib/jdk/test/lib/hprof/util/Misc.java#L84) Another simple example is Grails mixin String.encodeAsHTML() which internally calls [BasicXMLEncoder}(https://github.com/grails/grails-core/blob/7ab9e47ad805fbeb9433a488dd33f91bef44c0fa/grails-encoder/src/main/groovy/org/grails/encoder/impl/BasicXMLEncoder.java) – Sergey Ponomarev Sep 14 '20 at 11:54
12

Apparently, the answer is, "No." This was unfortunately a case where I had to do something and couldn't add a new external dependency for it -- in the short term. I agree with everyone that using Commons Lang is the best long-term solution. This is what I will go with once I can add a new library to the project.

It's a shame that something of such common use is not in the Java API.

Eddie
  • 53,828
  • 22
  • 125
  • 145
6

I've found that all existing solutions (libraries) I've reviewed suffered from one or several of the below issues:

  • They don't tell you in the Javadoc exactly what they replace.
  • They escape too much ... which makes the HTML much harder to read.
  • They do not document when the returned value is safe to use (safe to use for an HTML entity?, for an HTML attributute?, etc)
  • They are not optimized for speed.
  • They do not have a feature for avoiding double escaping (do not escape what is already escaped)
  • They replace single quote with &apos; (wrong!)

On top of this I also had the problem of not being able to bring in an external library, at least not without a certain amount of red tape.

So, I rolled my own. Guilty.

Below is what it looks like but the latest version can always be found in this gist.

/**
 * HTML string utilities
 */
public class SafeHtml {

    /**
     * Escapes a string for use in an HTML entity or HTML attribute.
     * 
     * <p>
     * The returned value is always suitable for an HTML <i>entity</i> but only
     * suitable for an HTML <i>attribute</i> if the attribute value is inside
     * double quotes. In other words the method is not safe for use with HTML
     * attributes unless you put the value in double quotes like this:
     * <pre>
     *    &lt;div title="value-from-this-method" &gt; ....
     * </pre>
     * Putting attribute values in double quotes is always a good idea anyway.
     * 
     * <p>The following characters will be escaped:
     * <ul>
     *   <li>{@code &} (ampersand) -- replaced with {@code &amp;}</li>
     *   <li>{@code <} (less than) -- replaced with {@code &lt;}</li>
     *   <li>{@code >} (greater than) -- replaced with {@code &gt;}</li>
     *   <li>{@code "} (double quote) -- replaced with {@code &quot;}</li>
     *   <li>{@code '} (single quote) -- replaced with {@code &#39;}</li>
     *   <li>{@code /} (forward slash) -- replaced with {@code &#47;}</li>
     * </ul>
     * It is not necessary to escape more than this as long as the HTML page
     * <a href="https://en.wikipedia.org/wiki/Character_encodings_in_HTML">uses
     * a Unicode encoding</a>. (Most web pages uses UTF-8 which is also the HTML5
     * recommendation.). Escaping more than this makes the HTML much less readable.
     * 
     * @param s the string to make HTML safe
     * @param avoidDoubleEscape avoid double escaping, which means for example not 
     *     escaping {@code &lt;} one more time. Any sequence {@code &....;}, as explained in
     *     {@link #isHtmlCharEntityRef(java.lang.String, int) isHtmlCharEntityRef()}, will not be escaped.
     * 
     * @return a HTML safe string 
     */
    public static String htmlEscape(String s, boolean avoidDoubleEscape) {
        if (s == null || s.length() == 0) {
            return s;
        }
        StringBuilder sb = new StringBuilder(s.length()+16);
        for (int i = 0; i < s.length(); i++) {
            char c = s.charAt(i);
            switch (c) {
                case '&':
                    // Avoid double escaping if already escaped
                    if (avoidDoubleEscape && (isHtmlCharEntityRef(s, i))) {
                        sb.append('&');
                    } else {
                        sb.append("&amp;");
                    }
                    break;
                case '<':
                    sb.append("&lt;");
                    break;
                case '>':
                    sb.append("&gt;");
                    break;
                case '"':
                    sb.append("&quot;"); 
                    break;
                case '\'':
                    sb.append("&#39;"); 
                    break;
                case '/':
                    sb.append("&#47;"); 
                    break;
                default:
                    sb.append(c);
            }
        }
        return sb.toString();
  }

  /**
   * Checks if the value at {@code index} is a HTML entity reference. This
   * means any of :
   * <ul>
   *   <li>{@code &amp;} or {@code &lt;} or {@code &gt;} or {@code &quot;} </li>
   *   <li>A value of the form {@code &#dddd;} where {@code dddd} is a decimal value</li>
   *   <li>A value of the form {@code &#xhhhh;} where {@code hhhh} is a hexadecimal value</li>
   * </ul>
   * @param str the string to test for HTML entity reference.
   * @param index position of the {@code '&'} in {@code str}
   * @return 
   */
  public static boolean isHtmlCharEntityRef(String str, int index)  {
      if (str.charAt(index) != '&') {
          return false;
      }
      int indexOfSemicolon = str.indexOf(';', index + 1);
      if (indexOfSemicolon == -1) { // is there a semicolon sometime later ?
          return false;
      }
      if (!(indexOfSemicolon > (index + 2))) {   // is the string actually long enough
          return false;
      }
      if (followingCharsAre(str, index, "amp;")
              || followingCharsAre(str, index, "lt;")
              || followingCharsAre(str, index, "gt;")
              || followingCharsAre(str, index, "quot;")) {
          return true;
      }
      if (str.charAt(index+1) == '#') {
          if (str.charAt(index+2) == 'x' || str.charAt(index+2) == 'X') {
              // It's presumably a hex value
              if (str.charAt(index+3) == ';') {
                  return false;
              }
              for (int i = index+3; i < indexOfSemicolon; i++) {
                  char c = str.charAt(i);
                  if (c >= 48 && c <=57) {  // 0 -- 9
                      continue;
                  }
                  if (c >= 65 && c <=70) {   // A -- F
                      continue;
                  }
                  if (c >= 97 && c <=102) {   // a -- f
                      continue;
                  }
                  return false;  
              }
              return true;   // yes, the value is a hex string
          } else {
              // It's presumably a decimal value
              for (int i = index+2; i < indexOfSemicolon; i++) {
                  char c = str.charAt(i);
                  if (c >= 48 && c <=57) {  // 0 -- 9
                      continue;
                  }
                  return false;
              }
              return true; // yes, the value is decimal
          }
      }
      return false;
  } 


  /**
   * Tests if the chars following position <code>startIndex</code> in string
   * <code>str</code> are that of <code>nextChars</code>.
   * 
   * <p>Optimized for speed. Otherwise this method would be exactly equal to
   * {@code (str.indexOf(nextChars, startIndex+1) == (startIndex+1))}.
   *
   * @param str
   * @param startIndex
   * @param nextChars
   * @return 
   */  
  private static boolean followingCharsAre(String str, int startIndex, String nextChars)  {
      if ((startIndex + nextChars.length()) < str.length()) {
          for(int i = 0; i < nextChars.length(); i++) {
              if ( nextChars.charAt(i) != str.charAt(startIndex+i+1)) {
                  return false;
              }
          }
          return true;
      } else {
          return false;
      }
  }
}

TODO: Preserve consecutive whitespace.

peterh
  • 18,404
  • 12
  • 87
  • 115
1

Please don't roll your own. Use Jakarta Commons Lang. It is tested and proven to work. Don't write code until you have to. "Not invented here" or "Not another dependency" is not a very good base for deciding what to choose / write.

bitboxer
  • 544
  • 6
  • 17
  • 10
    In general, I would agree with you. But I'm adding an additional diagnostic output to something that is in production. Lawyers get involved when a new 3rd party dependency is added. It's not as trivial as you think. Otherwise I would not have asked the question! – Eddie Mar 17 '09 at 20:18
  • 5
    Keep the philosophy out of stackoverflow :) everyone has their reasons to rewrite code. – ricosrealm May 10 '12 at 08:52
  • Usually, that's an advice to those who write code without knowing exactly what it does. Never listening to such advices made a developer out of me - I mean, that is how I learned and improved. – Ivaylo Slavov Sep 11 '12 at 09:10
  • Unless the project is supposed to be done yesterday and you have to take care of 3 other projects at the same time. Sometimes there are real-world constraints to think about, and rolling your own is usually a surefire way to introduce more bugs (and hence use more time). – futureelite7 Feb 04 '13 at 05:31
  • "'Not another dependency' is not a very good base for deciding what to choose / write." - I disagree. This mentality is the main reason most Java applications are such a bloated mess. – Greg Brown Mar 07 '16 at 17:12
  • just upgraded Eclipse, now it needs the 3.6 version, which is JRE8+ only, and the project I am upgrading it on must have 1.6 (as it rolls out to a machine that has 1.6 embedded on it). No! I can't upgrade the electronics as it's a 3rd party device, No! I can't tell them to update their fleet of 5 billion+ machines. And I can't roll back Eclipse since the plugin the 3rd party supplies expires and must have the new version. So I'm with @ricosrealm on this one, everybody has reasons that we wish we could do away with, but that's life. – Guy Park Sep 05 '17 at 06:37
0

No. I would recommend using the StringEscapeUtils you mentioned, or for example JTidy (http://jtidy.sourceforge.net/multiproject/jtidyservlet/apidocs/org/w3c/tidy/servlet/util/HTMLEncode.html).

simon
  • 12,666
  • 26
  • 78
  • 113
-1

I will suggest use org.springframework.web.util.HtmlUtils.htmlEscape(String input)

may be this will help.