106

Very similar to this question, except for Java.

What is the recommended way of encoding strings for an XML output in Java. The strings might contain characters like "&", "<", etc.

Community
  • 1
  • 1
Epaga
  • 38,231
  • 58
  • 157
  • 245

22 Answers22

130

As others have mentioned, using an XML library is the easiest way. If you do want to escape yourself, you could look into StringEscapeUtils from the Apache Commons Lang library.

Stevoisiak
  • 23,794
  • 27
  • 122
  • 225
Fabian Steeg
  • 44,988
  • 7
  • 85
  • 112
  • This could be the way to go if you don't care about absolute correctness, for example if you are putting together a prototype. – Chase Seibert Jan 13 '09 at 18:32
  • The escapeXml method of StringEscapeUtils seems to be a bit costly. Is there a more efficient method that operates on a StringBuffer instead of a String? – Chetan Kinger Sep 13 '12 at 07:00
  • 3
    Use `StringEscapeUtils.escapeXml(str)` from [`commons-lang`](http://commons.apache.org/proper/commons-lang/). I use it in App Engine application - work like a charm. Here is the [Java Doc](https://commons.apache.org/proper/commons-lang/javadocs/api-2.6/org/apache/commons/lang/StringEscapeUtils.html) for this function: – Oleg K Feb 15 '11 at 19:04
  • 1
    Do this method work for both XML content and attributes? To me it seems like it doesn't work for attributes. It doesn't seem to escape `\t`, `\n` and `\r`. – Lii Sep 27 '17 at 11:39
  • @Lii and `\t`, `\n` or `\r` needs to be escaped ? – Betlista Apr 16 '20 at 09:47
  • Note that `StringEscapeUtils.escapeXml()` does not escape control characters, which are invalid in XML in many situations – Chin Nov 26 '20 at 20:13
  • 4
    Note that it has been moved from `commons-lang` to `commons-text` – Gregor Jan 29 '21 at 15:11
37

Very simply: use an XML library. That way it will actually be right instead of requiring detailed knowledge of bits of the XML spec.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • 28
    Can you recommend such a library? (I find it surprising that this is not a standard part of Java edition 5...such a common task). – Tim Cooper Nov 16 '09 at 06:23
  • 4
    XML *is* part of the standard Java framework - look in org.w3c.sax and org.w3c.dom. However, there are some easier-to-use framework around as well, such as JDom. Note that there may not be an "encoding strings for XML output" method - I was more recommending that the whole XML task should be done with a library rather than just doing bits at a time with string manipulation. – Jon Skeet Nov 16 '09 at 06:28
  • 1
    This is not such useful advice when outputting XHTML - FlyingSaucer requires XML, but there ain't no way I'm templating through an XML lib :). Thankfully StringTemplate allows me to quickly escape all String objects. – Stephen Jan 13 '10 at 10:45
  • 1
    @Stephen: I would expect an XHTML library to use an XML library to keep everything sane, but expose an XHTML-centric API. Having to do escaping manually (and make sure you get it right *everywhere*) is not a great idea IMO. – Jon Skeet Jan 13 '10 at 11:19
  • To convert a DOM tree to an XML-string, use a transformer without a style sheet. – Thorbjørn Ravn Andersen May 19 '10 at 07:55
  • I wouldn't call it "Very simply". Some platforms don't have xml generation library, yet you may need to encode some text into xml. Adding few hundreds KB of lib just due to this task is not simple, and not wanted. I wouln't accept this answer. – Pointer Null Apr 05 '12 at 19:50
  • 5
    @mice: The question is tagged Java, and Java has *lots* of XML libraries. Indeed, there are XML APIs baked into Java, so there'd be no need to add *anything* else... but even if you did, a few hundred K is rarely a problem outside mobile these days. Even if it weren't Java, I'd be very wary of developing on a platform which didn't have any XML APIs... – Jon Skeet Apr 05 '12 at 19:52
  • I'm considering Android. It uses Java, and Apps have to be small. It has xml parsers, but I'm not aware of opposite (is it called "xml serializer"?). – Pointer Null Apr 05 '12 at 19:55
  • 2
    @mice: The DOM API is perfectly capable of generating XML. Or there are fairly small third-party libraries. (JDom's jar file is 114K for example.) **Using an XML API is still the recommended way of creating XML.** – Jon Skeet Apr 05 '12 at 20:03
  • What about this: http://stackoverflow.com/questions/439298/best-way-to-encode-text-data-for-xml-in-java/10035382#10035382 for simple purpose of escaping xml text (not building xml). – Pointer Null Apr 05 '12 at 20:17
  • @mice: I think I've made my position pretty clear. If I want to do anything with XML, I'll use an XML API. That's what they're good at. In my experience it's pretty rare to need to escape XML when you're not *building* XML. I'm not going to comment on the suitability of some code I wouldn't use in principle. (EDIT: Actually, I will in this case. I'll comment directly.) – Jon Skeet Apr 05 '12 at 20:26
  • No problem with your approach. However I use such code, which creates xml by String.printf and filling in some text in pre-constructed xml string. You may use xml lib, I can't in my specific case. – Pointer Null Apr 05 '12 at 20:31
  • 1
    @mice: It sounds like you may have picked a bad tool to start with then. Any library which is creating XML for me and inserting bits of text into it should be doing the escaping itself. It's not easy getting the full picture of what your specific requirements are in comments, but I certainly stand by my answer. – Jon Skeet Apr 05 '12 at 20:35
  • 1
    Just a general observation seeing that the word "right" is emphasized: By just using any XML library it doesn't guarantee that it will be right ;-). Library developers are also human. Of course you'b be pretty safe with standard stuff or something like Apache Commons Lang... just ever so often amazed by how people just trust other people's code blindly... – Hannes de Jager Sep 10 '12 at 07:45
  • @JonSkeet I am converting csv file to xml just using Java (no groovy). What kind of XML libraries are out there for such conversion? Thanks! – Charu Khurana Apr 21 '14 at 18:20
  • @Learner: There are loads. You might want to start with jdom. – Jon Skeet Apr 21 '14 at 18:22
  • We are using the woodstox stax library and it does not have a way of writing text where it will encode the special characters. It has the call writeCharacters but it doesn't encode > (does encode <). – David Thielen Jun 10 '18 at 20:45
  • 1
    @DavidThielen That doesn't need to be encoded. It often is, but IIRC the XML spec calls it out as being okay not to be encoded. – Jon Skeet Jun 11 '18 at 02:43
  • Be carefully! XML and therefore also the DOM API only supports characters in the ranges #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]. If you want to use characters outside of this ranges you have to escape them with an own escaping system additionally. You can use my methods here https://stackoverflow.com/a/59475093/3882565. – stonar96 Dec 25 '19 at 12:51
  • 1
    @stonar96 Yes, I've been assuming that a valid XML document is the desired result. If you need to express things that can't be expressed in XML, that's a bigger problem. It's a shame that XML 1.1 never really took off, as that fixes this problem. – Jon Skeet Dec 26 '19 at 09:08
  • Sorry man, I downvoted it and was wrong. Now cannot change it back. This answer is correct – Alexandr Mar 27 '20 at 17:02
  • @Alexandr: Really not a problem :) – Jon Skeet Mar 27 '20 at 18:12
20

Just use.

<![CDATA[ your text here ]]>

This will allow any characters except the ending

]]>

So you can include characters that would be illegal such as & and >. For example.

<element><![CDATA[ characters such as & and > are allowed ]]></element>

However, attributes will need to be escaped as CDATA blocks can not be used for them.

ng.
  • 7,099
  • 1
  • 38
  • 42
  • 12
    In most cases, that is not what you should do. Too many people abuse the CDATA tags. The intent of the CDATA is to tell the processor not to process it as XML and just pass it through. If you are trying to create an XML file, then you should be creating XML, not just passing bytes through some wrapping element. – Mads Hansen May 16 '09 at 16:05
  • 2
    @Mads, using CDATA results in a valid XML file so it is just as fine as doing it the "right way". If you dislike it, then parse it afterwards, identity transform it, and print it. – Thorbjørn Ravn Andersen May 19 '10 at 07:56
  • 26
    If you wrap text in a CDATA element you have to escape the CDATA closing marker: "]]>"... except you cannot escape that. So instead you have to break your code into pieces where you put half of the data in one CDATA element and the other half in a second: <![CDATA[This data contain a CDATA closing marker: "]]]]><![CDATA[>" that is why it had to be split up.]]> ... In the end it may be a lot simpler to just escape '<', '>' and '&' instead. Of course many apps ignore the potential problem with CDATA closing markers in the data. Ignorance is bliss I guess. :) – Stijn de Witt Dec 14 '10 at 12:39
  • 3
    @StijndeWitt is absolutely correct. CDATA is not a panacea for escaping special characters. – dnault Dec 05 '14 at 22:52
  • This is a bad idea. CDATA does not allow any character outside of the XML's encoding. – Florian F Feb 20 '20 at 09:51
  • In XML file (Java and DOM Parser), "<" is present as node text value but when node.getContentType is used for this node then this is converted to "<". Is there anyway to retrieve "<" itself, instead of "<"? – Rohit Kumar Aug 06 '20 at 06:47
18

This question is eight years old and still not a fully correct answer! No, you should not have to import an entire third party API to do this simple task. Bad advice.

The following method will:

  • correctly handle characters outside the basic multilingual plane
  • escape characters required in XML
  • escape any non-ASCII characters, which is optional but common
  • replace illegal characters in XML 1.0 with the Unicode substitution character. There is no best option here - removing them is just as valid.

I've tried to optimise for the most common case, while still ensuring you could pipe /dev/random through this and get a valid string in XML.

public static String encodeXML(String s) {
    StringBuilder sb = new StringBuilder();
    int len = s.length();
    for (int i=0;i<len;) {
        int c = s.codePointAt(i);
        if (c < 0x80) {      // ASCII range: test most common case first
            if (c < 0x20 && (c != '\t' && c != '\r' && c != '\n')) {
                // Illegal XML character, even encoded. Skip or substitute
                sb.append("&#xfffd;");   // Unicode replacement character
            } else {
                switch(c) {
                  case '&':  sb.append("&amp;"); break;
                  case '>':  sb.append("&gt;"); break;
                  case '<':  sb.append("&lt;"); break;
                  // Uncomment next two if encoding for an XML attribute
//                  case '\''  sb.append("&apos;"); break;
//                  case '\"'  sb.append("&quot;"); break;
                  // Uncomment next three if you prefer, but not required
//                  case '\n'  sb.append("&#10;"); break;
//                  case '\r'  sb.append("&#13;"); break;
//                  case '\t'  sb.append("&#9;"); break;

                  default:   sb.append((char)c);
                }
            }
        } else if ((c >= 0xd800 && c <= 0xdfff) || c == 0xfffe || c == 0xffff) {
            // Illegal XML character, even encoded. Skip or substitute
            sb.append("&#xfffd;");   // Unicode replacement character
        } else {
            sb.append("&#x");
            sb.append(Integer.toHexString(c));
            sb.append(';');
        }
        i += c <= 0xffff ? 1 : 2;
    }
    return sb.toString();
}

Edit: for those who continue to insist it foolish to write your own code for this when there are perfectly good Java APIs to deal with XML, you might like to know that the StAX API included with Oracle Java 8 (I haven't tested others) fails to encode CDATA content correctly: it doesn't escape ]]> sequences in the content. A third party library, even one that's part of the Java core, is not always the best option.

Mike B
  • 1,600
  • 1
  • 12
  • 8
  • +1 for standalone code. Just comparing your code with [guava implementation](https://google.github.io/guava/releases/26.0-jre/api/docs/src-html/com/google/common/xml/XmlEscapers.html#line.75), I'm wondering what about '\t','\n','\r' ? See also notes at [guava docs](https://google.github.io/guava/releases/26.0-jre/api/docs/com/google/common/xml/XmlEscapers.html#xmlContentEscaper--) – jschnasse Sep 17 '18 at 09:51
  • 2
    There's no need to escape \n, \r and \t, they are valid, although they do make formatting a bit ugly. I've modified the code to show how to escsape them if that's what you want. – Mike B Dec 18 '18 at 12:14
  • 2
    There is *no* way to "escape ]]>" in CDATA. – kmkaplan Oct 22 '19 at 06:42
  • 1
    Then it should reject the content by throwing an IllegalArgumentException. Under no circumstances should it claim to succeed but still output invalid XML. – Mike B Oct 23 '19 at 11:20
  • Instead of replacing illegal characters in XML 1.0 with the Unicode substitution character you can use my methods here https://stackoverflow.com/a/59475093/3882565. – stonar96 Dec 25 '19 at 12:41
  • Useful and appreciated! Nevertheless hard to read code. – Max M Jun 09 '22 at 14:58
13

Try this:

String xmlEscapeText(String t) {
   StringBuilder sb = new StringBuilder();
   for(int i = 0; i < t.length(); i++){
      char c = t.charAt(i);
      switch(c){
      case '<': sb.append("&lt;"); break;
      case '>': sb.append("&gt;"); break;
      case '\"': sb.append("&quot;"); break;
      case '&': sb.append("&amp;"); break;
      case '\'': sb.append("&apos;"); break;
      default:
         if(c>0x7e) {
            sb.append("&#"+((int)c)+";");
         }else
            sb.append(c);
      }
   }
   return sb.toString();
}
Pointer Null
  • 39,597
  • 13
  • 90
  • 111
  • 9
    You've got at least two bugs that I can see. One is subtle, the other isn't. I wouldn't have such a bug - because I wouldn't reinvent the wheel in the first place. – Jon Skeet Apr 05 '12 at 20:29
  • 1
    And iterating through Unicode strings is a bit more complicated. See here: http://stackoverflow.com/q/1527856/402322 – ceving Sep 26 '12 at 16:33
  • I assume the non-subtle bug is the "guot" which was corrected - I also get a warning on the appending concatenated strings in the StringBuilder. What is the subtle bug? I honestly like a stand-alone solution like this for my current implementation, embedded where we can't import apache libraries. – Guy Starbuck Dec 12 '12 at 15:48
  • 1
    Not sure it is _subtle_ but It'd better consider the case where `t==null`. – Myobis Dec 12 '13 at 23:08
  • As a comparison, org.apache.commons.lang3.StringEscapeUtils.escapeXml supports only the five basic XML entities (gt, lt, quot, amp, apos). Note that Unicode characters greater than 0x7f are no longer escaped. ([source](http://bit.ly/19FpMvH)) – Myobis Dec 12 '13 at 23:17
  • Shouldn't your default case's if condition read "if (c<32 || c>0x7e) {"? Otherwise, you are encoding all characters less than space as just themselves which is invalid XML content, right? – chaotic3quilibrium Dec 19 '13 at 22:05
  • @chaotic3quilibrium: expected chars <32 are only new-lines or tabs, and these are not escaped. – Pointer Null Dec 20 '13 at 08:35
  • @PointerNull Are you sure about that?! I've read MANY other places that those are to be escaped, too. Do you have an official reference you can cite which explicitly states they are not to be escaped? If so, I'd really appreciate it (and perhaps it would begin to stem the vast number of suggestions to encode this space). – chaotic3quilibrium Dec 20 '13 at 15:01
  • @PointerNull Ok, It's an old thing to mention now, but a lot of the non-printable Unicode characters that map to ASCII for compatibility are going to be passed along with this routine. I'm thinking of things like embedded null values `0x00`, embedded "start of text" `0x02`, "end of transmission" `0x04`, and so on. Certainly not to be expected in your typical Java string, but it's funny how such things get slipped in. – Edwin Buck Nov 13 '14 at 18:31
  • This is a terrible solution, anyone reading do not use this. This will convert "&" into "&ampamp;" and whatnot. . – cowsay Dec 17 '14 at 16:06
  • 1
    @user1003916: XML escaping is designed to convert any & occurrence into & so that's how it has to work. If you excape already escaped string, that's your fault. – Pointer Null Dec 19 '14 at 09:33
  • 3
    I'm happy with the final version. Java SE is compact, fast, and efficient. Doing just what needs to be done rather than downloading another 100 MB of bloatware is always better in my book. – Roger F. Gay Nov 10 '15 at 16:40
  • 2
    All characters below 0x20 except 0x09, 0x0A and 0x0D are invalid in XML. This applies whether they are escaped or not. The only correct way to handle those is to skip them or throw an Exception. Other than that, this is a good solution and similar to the one we'd typically use. – Mike B Nov 18 '16 at 10:40
  • @ceving You don't have to deal with Unicode here as all characters outside the BMP can be simply copied as they are. The only five codepoints needing processing are in the BMP. – maaartinus Oct 12 '19 at 05:16
  • For a method which also supports invalid XML characters like ```'\u0000'``` see my answer here https://stackoverflow.com/a/59475093/3882565. – stonar96 Dec 25 '19 at 12:35
  • The question was "what is the recommended way". Writing your own method for a common task isn't. – Florian F Feb 20 '20 at 09:56
13

This has worked well for me to provide an escaped version of a text string:

public class XMLHelper {

/**
 * Returns the string where all non-ascii and <, &, > are encoded as numeric entities. I.e. "&lt;A &amp; B &gt;"
 * .... (insert result here). The result is safe to include anywhere in a text field in an XML-string. If there was
 * no characters to protect, the original string is returned.
 * 
 * @param originalUnprotectedString
 *            original string which may contain characters either reserved in XML or with different representation
 *            in different encodings (like 8859-1 and UFT-8)
 * @return
 */
public static String protectSpecialCharacters(String originalUnprotectedString) {
    if (originalUnprotectedString == null) {
        return null;
    }
    boolean anyCharactersProtected = false;

    StringBuffer stringBuffer = new StringBuffer();
    for (int i = 0; i < originalUnprotectedString.length(); i++) {
        char ch = originalUnprotectedString.charAt(i);

        boolean controlCharacter = ch < 32;
        boolean unicodeButNotAscii = ch > 126;
        boolean characterWithSpecialMeaningInXML = ch == '<' || ch == '&' || ch == '>';

        if (characterWithSpecialMeaningInXML || unicodeButNotAscii || controlCharacter) {
            stringBuffer.append("&#" + (int) ch + ";");
            anyCharactersProtected = true;
        } else {
            stringBuffer.append(ch);
        }
    }
    if (anyCharactersProtected == false) {
        return originalUnprotectedString;
    }

    return stringBuffer.toString();
}

}
Redwood
  • 66,744
  • 41
  • 126
  • 187
Thorbjørn Ravn Andersen
  • 73,784
  • 33
  • 194
  • 347
  • 1
    stringBuffer.append("" + (int) ch + ";"); This won't work for multibyte characters. I'm running into this right now with an emoji character, UTF8 sequence F0 9F 98 8D. – Kylar Dec 15 '11 at 16:20
9

StringEscapeUtils.escapeXml() does not escape control characters (< 0x20). XML 1.1 allows control characters; XML 1.0 does not. For example, XStream.toXML() will happily serialize a Java object's control characters into XML, which an XML 1.0 parser will reject.

To escape control characters with Apache commons-lang, use

NumericEntityEscaper.below(0x20).translate(StringEscapeUtils.escapeXml(str))
Steve Mitchell
  • 1,895
  • 1
  • 15
  • 12
9
public String escapeXml(String s) {
    return s.replaceAll("&", "&amp;").replaceAll(">", "&gt;").replaceAll("<", "&lt;").replaceAll("\"", "&quot;").replaceAll("'", "&apos;");
}
iCrazybest
  • 2,935
  • 2
  • 24
  • 24
  • 7
    Chaining `replaceAll` calls is very inefficient, especially for large strings. Every call results in a new String object being created, which will hang around until garbage collected. Also, each call requires looping through the string again. This could be consolidated into one single manual loop with comparisons against each target char in every iteration. – daiscog Jan 27 '15 at 14:56
  • 2
    This should be the accepted answer, even if it is inefficient. It solves the problem in a single line. – Stimpson Cat Feb 13 '18 at 08:15
  • And it has many bugs. See [this comment above](https://stackoverflow.com/questions/439298/best-way-to-encode-text-data-for-xml-in-java#comment68579952_10035382) – David Balažic Nov 12 '18 at 18:08
  • To fix these bugs you can additionally use my method here https://stackoverflow.com/a/59475093/3882565. Note that this is not a replacement but it can be used additionally. – stonar96 Dec 25 '19 at 12:43
8

For those looking for the quickest-to-write solution: use methods from apache commons-lang:

Remember to include dependency:

<dependency>
  <groupId>org.apache.commons</groupId>
  <artifactId>commons-lang3</artifactId>
  <version>3.5</version> <!--check current version! -->
</dependency>
Dariusz
  • 21,561
  • 9
  • 74
  • 114
6

While I agree with Jon Skeet in principle, sometimes I don't have the option to use an external XML library. And I find it peculiar the two functions to escape/unescape a simple value (attribute or tag, not full document) are not available in the standard XML libraries included with Java.

As a result and based on the different answers I have seen posted here and elsewhere, here is the solution I've ended up creating (nothing worked as a simple copy/paste):

  public final static String ESCAPE_CHARS = "<>&\"\'";
  public final static List<String> ESCAPE_STRINGS = Collections.unmodifiableList(Arrays.asList(new String[] {
      "&lt;"
    , "&gt;"
    , "&amp;"
    , "&quot;"
    , "&apos;"
  }));

  private static String UNICODE_NULL = "" + ((char)0x00); //null
  private static String UNICODE_LOW =  "" + ((char)0x20); //space
  private static String UNICODE_HIGH = "" + ((char)0x7f);

  //should only be used for the content of an attribute or tag      
  public static String toEscaped(String content) {
    String result = content;
    
    if ((content != null) && (content.length() > 0)) {
      boolean modified = false;
      StringBuilder stringBuilder = new StringBuilder(content.length());
      for (int i = 0, count = content.length(); i < count; ++i) {
        String character = content.substring(i, i + 1);
        int pos = ESCAPE_CHARS.indexOf(character);
        if (pos > -1) {
          stringBuilder.append(ESCAPE_STRINGS.get(pos));
          modified = true;
        }
        else {
          if (    (character.compareTo(UNICODE_LOW) > -1)
               && (character.compareTo(UNICODE_HIGH) < 1)
             ) {
            stringBuilder.append(character);
          }
          else {
            //Per URL reference below, Unicode null character is always restricted from XML
            //URL: https://en.wikipedia.org/wiki/Valid_characters_in_XML
            if (character.compareTo(UNICODE_NULL) != 0) {
              stringBuilder.append("&#" + ((int)character.charAt(0)) + ";");
            }
            modified = true;
          }
        }
      }
      if (modified) {
        result = stringBuilder.toString();
      }
    }
    
    return result;
  }

The above accommodates several different things:

  1. avoids using char based logic until it absolutely has to - improves unicode compatibility
  2. attempts to be as efficient as possible given the probability is the second "if" condition is likely the most used pathway
  3. is a pure function; i.e. is thread-safe
  4. optimizes nicely with the garbage collector by only returning the contents of the StringBuilder if something actually changed - otherwise, the original string is returned

At some point, I will write the inversion of this function, toUnescaped(). I just don't have time to do that today. When I do, I will come update this answer with the code. :)

chaotic3quilibrium
  • 5,661
  • 8
  • 53
  • 86
  • Looks pretty good to me. I do not wish to add another jar to my project for only one method. If you please grant permission, may I copy paste your code in mine? – RuntimeException Jan 23 '14 at 13:15
  • 1
    @SatishMotwani Of course you can take the above code and do with it as you like. It's my understanding that any code published on StackOverflow is assumed to be copyright free (isn't covered as a work in totality). On the flip side, it would be exceedingly difficult for someone to press any sort of copyright claim and expect an sort of outcome for themselves. – chaotic3quilibrium Jan 23 '14 at 14:46
  • 1
    Thanks for permitting :-) I will use it. – RuntimeException Jan 29 '14 at 15:41
  • 1
    You forgot to handle NUL characters. And maybe other things too. – David Balažic Nov 12 '18 at 18:10
  • @DavidBalažic Okay, please explain in more detail what I might have missed it? Please read through the code more closely. I handled EVERY SINGLE Unicode character (of the 1,111,998), including the `null` character. Can you explain the definition of the two values, `UNICODE_LOW` and `UNICODE_HIGH`? Please reread the `if` that uses those two values. Notice `null` (`\u0000` which is `(int)0`) doesn't fall between these two values. Read out how it becomes properly "escaped" just like ALL Unicode characters existing outside the `UNICODE_LOW` and `UNICODE_HIGH` range, by using the `` technique. – chaotic3quilibrium Aug 29 '20 at 16:18
  • 1
    @chaotic3quilibrium NULL is illegal in XML (and some other characters too). Doesn't matter how you encode it. It is illegal. (also: there is really no need to escape Unicode characters, they are nicely supported in XML, except if the XML document has a non-Unicode encoding itself) – David Balažic Aug 29 '20 at 18:14
  • @DavidBalažic Ah. Tysvm for your explanation. I found a reference document that explicitly identifies what you are asserting about Unicode `null`. However, that is literally the ONLY character not allowed. There are many which are strongly discouraged, but NUL is the only one explicitly unreservedly restricted: https://en.wikipedia.org/wiki/Valid_characters_in_XML – chaotic3quilibrium Aug 29 '20 at 22:36
  • @DavidBalažic I have updated the answer to now incorporate the restriction cited in the reference I shared in my last comment. I added a comment and logic to correctly handle this case. Again, tysvm for your precise feedback. – chaotic3quilibrium Aug 29 '20 at 22:46
6

While idealism says use an XML library, IMHO if you have a basic idea of XML then common sense and performance says template it all the way. It's arguably more readable too. Though using the escaping routines of a library is probably a good idea.

Consider this: XML was meant to be written by humans.

Use libraries for generating XML when having your XML as an "object" better models your problem. For example, if pluggable modules participate in the process of building this XML.

Edit: as for how to actually escape XML in templates, use of CDATA or escapeXml(string) from JSTL are two good solutions, escapeXml(string) can be used like this:

<%@taglib prefix="fn" uri="http://java.sun.com/jsp/jstl/functions"%>

<item>${fn:escapeXml(value)}</item>
Amr Mostafa
  • 23,147
  • 2
  • 29
  • 24
6

The behavior of StringEscapeUtils.escapeXml() has changed from Commons Lang 2.5 to 3.0. It now no longer escapes Unicode characters greater than 0x7f.

This is a good thing, the old method was to be a bit to eager to escape entities that could just be inserted into a utf8 document.

The new escapers to be included in Google Guava 11.0 also seem promising: http://code.google.com/p/guava-libraries/issues/detail?id=799

Jasper Krijgsman
  • 1,018
  • 11
  • 13
  • 1
    Here's Guava's XML escaper: http://code.google.com/p/guava-libraries/source/browse/guava/src/com/google/common/xml/XmlEscapers.java?r=8124eb561b979c5d4300f5694f8871d1d7a5619e. In general, I've found Guava to be better architected than Apache Commons. – jhclark Jan 30 '12 at 18:00
  • https://google.github.io/guava/releases/23.0/api/docs/com/google/common/xml/XmlEscapers.html – Vadzim Apr 25 '18 at 21:11
5

Note: Your question is about escaping, not encoding. Escaping is using <, etc. to allow the parser to distinguish between "this is an XML command" and "this is some text". Encoding is the stuff you specify in the XML header (UTF-8, ISO-8859-1, etc).

First of all, like everyone else said, use an XML library. XML looks simple but the encoding+escaping stuff is dark voodoo (which you'll notice as soon as you encounter umlauts and Japanese and other weird stuff like "full width digits" (&#FF11; is 1)). Keeping XML human readable is a Sisyphus' task.

I suggest never to try to be clever about text encoding and escaping in XML. But don't let that stop you from trying; just remember when it bites you (and it will).

That said, if you use only UTF-8, to make things more readable you can consider this strategy:

  • If the text does contain '<', '>' or '&', wrap it in <![CDATA[ ... ]]>
  • If the text doesn't contain these three characters, don't warp it.

I'm using this in an SQL editor and it allows the developers to cut&paste SQL from a third party SQL tool into the XML without worrying about escaping. This works because the SQL can't contain umlauts in our case, so I'm safe.

Aaron Digulla
  • 321,842
  • 108
  • 597
  • 820
4

If you are looking for a library to get the job done, try:

  1. Guava 26.0 documented here

    return XmlEscapers.xmlContentEscaper().escape(text);

    Note: There is also an xmlAttributeEscaper()

  2. Apache Commons Text 1.4 documented here

    StringEscapeUtils.escapeXml11(text)

    Note: There is also an escapeXml10() method

jschnasse
  • 8,526
  • 6
  • 32
  • 72
3

To escape XML characters, the easiest way is to use the Apache Commons Lang project, JAR downloadable from: http://commons.apache.org/lang/

The class is this: org.apache.commons.lang3.StringEscapeUtils;

It has a method named "escapeXml", that will return an appropriately escaped String.

Greg Burdett
  • 191
  • 2
  • 6
  • Update: escapeXml is now deprecated - use escapeXml10. Ref https://commons.apache.org/proper/commons-lang/javadocs/api-3.3/org/apache/commons/lang3/StringEscapeUtils.html#escapeXml(java.lang.String) – Daniel Aug 01 '17 at 03:19
2

You could use the Enterprise Security API (ESAPI) library, which provides methods like encodeForXML and encodeForXMLAttribute. Take a look at the documentation of the Encoder interface; it also contains examples of how to create an instance of DefaultEncoder.

Software Craftsman
  • 2,999
  • 2
  • 31
  • 47
1

Here's an easy solution and it's great for encoding accented characters too!

String in = "Hi Lârry & Môe!";

StringBuilder out = new StringBuilder();
for(int i = 0; i < in.length(); i++) {
    char c = in.charAt(i);
    if(c < 31 || c > 126 || "<>\"'\\&".indexOf(c) >= 0) {
        out.append("&#" + (int) c + ";");
    } else {
        out.append(c);
    }
}

System.out.printf("%s%n", out);

Outputs

Hi L&#226;rry &#38; M&#244;e!
Mike
  • 1,390
  • 1
  • 12
  • 17
  • Shouldn't the "31" in the first line of the "if" be "32"; i.e. less than the space character? And if "31" must remain, then shouldn't it be corrected to read "if (c <= 31 ||..." (additional equals sign following the less than sign)? – chaotic3quilibrium Dec 19 '13 at 22:03
1

Use JAXP and forget about text handling it will be done for you automatically.

Fernando Miguélez
  • 11,196
  • 6
  • 36
  • 54
0

Try to encode the XML using Apache XML serializer

//Serialize DOM
OutputFormat format    = new OutputFormat (doc); 
// as a String
StringWriter stringOut = new StringWriter ();    
XMLSerializer serial   = new XMLSerializer (stringOut, 
                                          format);
serial.serialize(doc);
// Display the XML
System.out.println(stringOut.toString());
Carbine
  • 7,849
  • 4
  • 30
  • 54
0

Just replace

 & with &amp;

And for other characters:

> with &gt;
< with &lt;
\" with &quot;
' with &apos;
raman rayat
  • 404
  • 5
  • 15
0

Here's what I found after searching everywhere looking for a solution:

Get the Jsoup library:

<!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.12.1</version>
</dependency>

Then:

import org.jsoup.Jsoup
import org.jsoup.nodes.Document
import org.jsoup.nodes.Entities
import org.jsoup.parser.Parser

String xml = '''<?xml version = "1.0"?>
<SOAP-ENV:Envelope
   xmlns:SOAP-ENV = "http://www.w3.org/2001/12/soap-envelope"
   SOAP-ENV:encodingStyle = "http://www.w3.org/2001/12/soap-encoding">

   <SOAP-ENV:Body xmlns:m = "http://www.example.org/quotations">
      <m:GetQuotation>
         <m:QuotationsName> MiscroSoft@G>>gle.com </m:QuotationsName>
      </m:GetQuotation>
   </SOAP-ENV:Body>
</SOAP-ENV:Envelope>'''



Document doc = Jsoup.parse(new ByteArrayInputStream(xml.getBytes("UTF-8")), "UTF-8", "", Parser.xmlParser())
doc.outputSettings().charset("UTF-8")
doc.outputSettings().escapeMode(Entities.EscapeMode.base)

println doc.toString()

Hope this helps someone

wizston
  • 31
  • 9
-1

I have created my wrapper here, hope it will helps a lot, Click here You can modify depends on your requirements