29

Hi i would like to remove all invalid XML characters from a string. i would like to use a regular expression with the string.replace method.

like

line.replace(regExp,"");

what is the right regExp to use ?

invalid XML character is everything that is not this :

[#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

thanks.

yossi
  • 12,945
  • 28
  • 84
  • 110
  • 1
    It depends on what you want to replace. What is "invalid XML character"? – khachik Nov 21 '10 at 11:39
  • you are right i have added the information – yossi Nov 21 '10 at 11:48
  • Why do you think that characters in that range are invalid for XML? You can use `[^\u0001-\uD7FF\uE000-\uFFFD]` to match 2-byte unicode chars out of the range (needs to be checked, I'm not sure about the syntax). Don't know anything about 24 bit chars, sorry. – khachik Nov 21 '10 at 12:03
  • 1
    found the valid XML characters here: http://www.w3.org/TR/2006/REC-xml11-20060816/#NT-RestrictedChar – yossi Nov 21 '10 at 12:19
  • Neat solution http://stackoverflow.com/a/9635310/489364 – kommradHomer May 20 '13 at 08:42

9 Answers9

90

Java's regex supports supplementary characters, so you can specify those high ranges with two UTF-16 encoded chars.

Here is the pattern for removing characters that are illegal in XML 1.0:

// XML 1.0
// #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
String xml10pattern = "[^"
                    + "\u0009\r\n"
                    + "\u0020-\uD7FF"
                    + "\uE000-\uFFFD"
                    + "\ud800\udc00-\udbff\udfff"
                    + "]";

Most people will want the XML 1.0 version.

Here is the pattern for removing characters that are illegal in XML 1.1:

// XML 1.1
// [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
String xml11pattern = "[^"
                    + "\u0001-\uD7FF"
                    + "\uE000-\uFFFD"
                    + "\ud800\udc00-\udbff\udfff"
                    + "]+";

You will need to use String.replaceAll(...) and not String.replace(...).

String illegal = "Hello, World!\0";
String legal = illegal.replaceAll(pattern, "");
McDowell
  • 107,573
  • 31
  • 204
  • 267
  • The link is broken, the right one seems to be: http://www.oracle.com/technetwork/articles/javase/supplementary-142654.html – evgenyl Apr 22 '13 at 10:00
  • 3
    May by I am wrong, but this ranges will NOT remove characters like \b (\u0008), and so on. But this chars will also break the xml marshaling. Can you also please hint about your' comment for answer with Mark McLaren's Weblog? Thank you! – evgenyl Apr 22 '13 at 10:02
  • @evgenyl U+0008 is in the range "\u0001-\uD7FF" and will not be replaced - its use is legal in XML. You will have to modify the regular expression if you want to remove text in the [restricted or discouraged ranges](http://www.w3.org/TR/2006/REC-xml11-20060816/#charsets). The problem with Renaud's answer is that it checks char values and not Unicode code points. Jun's answer shows the conversion of UTF-16 code units to code points – McDowell Apr 22 '13 at 10:28
  • Thanks for clarification. In my case, I am failing on Xml Serialization (unmarshaling) of SOAP xml, contains '\b' in body (as element value, not in attribute or so on). So I am not sure about its be "legal xml chracter". But may be I I'll need to study more. :) – evgenyl Apr 22 '13 at 10:39
  • @evgenyl Thanks for pointing that out. I was referencing the XML 1.1 specification. I've updated the answer for XML 1.0. Note that the code is largely untested so check it before using it. – McDowell Apr 22 '13 at 11:19
  • Sure! Thank you for replay! Diving into Supplementary Characters in the Java Platform ... – evgenyl Apr 22 '13 at 11:21
  • 3
    The \ud800\udc00-\udbff\udfff syntax was at first very misleading for me, it's just that Java Regex engine interprets that pair as single character, am I right? – Danubian Sailor Feb 04 '14 at 13:50
  • 2
    @ŁukaszL. Correct. The UTF-16 sequence `D800 DC00` is code point U+10000, `DBFF DFFF` is U+10FFFF, and Java's regex engine respects surrogate pairs. – McDowell Feb 04 '14 at 14:03
  • @McDowell so we could say, Java supports surrogate pairs internally, however externally (syntax) they are poorly supported? I'll have to fix my code supporting Unicode... – Danubian Sailor Feb 04 '14 at 14:06
  • Don't these expressions match any character that's NOT an invalid character due to the negation character (^) at the beginning of the range? Am I missing something? – Redtopia Jan 23 '15 at 16:07
  • @Redtopia - the code points expressed in the patterns are the supported code points. The negation matches anything unsupported so it can be removed. – McDowell Jan 23 '15 at 16:09
  • Doh! I thought they were the illegal chars. I guess it's equivalent, but do you think there's any advantage to matching only the illegal chars? ([#x1-#x8] | [#xB-#xC] | [#xE-#x1F] | [#x7F-#x84] | [#x86-#x9F]) - http://www.w3.org/TR/2006/REC-xml11-20060816/#NT-RestrictedChar – Redtopia Jan 23 '15 at 16:19
  • I found your regex used in the product I'm working on. IntelliJ Idea doesn't like the seemingly negative character range and displays an error. I'm glad to see that it's actually not an error. – Vlasec Feb 02 '15 at 17:16
11

All these answers so far only replace the characters themselves. But sometimes an XML document will have invalid XML entity sequences resulting in errors. For example, if you have  in your xml, a java xml parser will throw Illegal character entity: expansion character (code 0x2 at ....

Here is a simple java program that can replace those invalid entity sequences.

  public final Pattern XML_ENTITY_PATTERN = Pattern.compile("\\&\\#(?:x([0-9a-fA-F]+)|([0-9]+))\\;");

  /**
   * Remove problematic xml entities from the xml string so that you can parse it with java DOM / SAX libraries.
   */
  String getCleanedXml(String xmlString) {
    Matcher m = XML_ENTITY_PATTERN.matcher(xmlString);
    Set<String> replaceSet = new HashSet<>();
    while (m.find()) {
      String group = m.group(1);
      int val;
      if (group != null) {
        val = Integer.parseInt(group, 16);
        if (isInvalidXmlChar(val)) {
          replaceSet.add("&#x" + group + ";");
        }
      } else if ((group = m.group(2)) != null) {
        val = Integer.parseInt(group);
        if (isInvalidXmlChar(val)) {
          replaceSet.add("&#" + group + ";");
        }
      }
    }
    String cleanedXmlString = xmlString;
    for (String replacer : replaceSet) {
      cleanedXmlString = cleanedXmlString.replaceAll(replacer, "");
    }
    return cleanedXmlString;
  }

  private boolean isInvalidXmlChar(int val) {
    if (val == 0x9 || val == 0xA || val == 0xD ||
            val >= 0x20 && val <= 0xD7FF ||
            val >= 0x10000 && val <= 0x10FFFF) {
      return false;
    }
    return true;
  }
Nicholas DiPiazza
  • 10,029
  • 11
  • 83
  • 152
  • This was indeed the right answer for me. I was converting a JSONObject to XML which escaped control chars from "\u0001" to "". This code perfectly removed it. – Matze.N Jul 29 '21 at 11:09
10

Should we consider surrogate characters? otherwise '(current >= 0x10000) && (current <= 0x10FFFF)' will never be true.

Also tested that the regex way seems slower than the following loop.

if (null == text || text.isEmpty()) {
    return text;
}
final int len = text.length();
char current = 0;
int codePoint = 0;
StringBuilder sb = new StringBuilder();
for (int i = 0; i < len; i++) {
    current = text.charAt(i);
    boolean surrogate = false;
    if (Character.isHighSurrogate(current)
            && i + 1 < len && Character.isLowSurrogate(text.charAt(i + 1))) {
        surrogate = true;
        codePoint = text.codePointAt(i++);
    } else {
        codePoint = current;
    }
    if ((codePoint == 0x9) || (codePoint == 0xA) || (codePoint == 0xD)
            || ((codePoint >= 0x20) && (codePoint <= 0xD7FF))
            || ((codePoint >= 0xE000) && (codePoint <= 0xFFFD))
            || ((codePoint >= 0x10000) && (codePoint <= 0x10FFFF))) {
        sb.append(current);
        if (surrogate) {
            sb.append(text.charAt(i));
        }
    }
}
Jun
  • 101
  • 1
  • 3
3

Jun's solution, simplified. Using StringBuffer#appendCodePoint(int), I need no char current or String#charAt(int). I can tell a surrogate pair by checking if codePoint is greater than 0xFFFF.

(It is not necessary to do the i++, since a low surrogate wouldn't pass the filter. But then one would re-use the code for different code points and it would fail. I prefer programming to hacking.)

StringBuilder sb = new StringBuilder();
for (int i = 0; i < text.length(); i++) {
    int codePoint = text.codePointAt(i);
    if (codePoint > 0xFFFF) {
        i++;
    }
    if ((codePoint == 0x9) || (codePoint == 0xA) || (codePoint == 0xD)
            || ((codePoint >= 0x20) && (codePoint <= 0xD7FF))
            || ((codePoint >= 0xE000) && (codePoint <= 0xFFFD))
            || ((codePoint >= 0x10000) && (codePoint <= 0x10FFFF))) {
        sb.appendCodePoint(codePoint);
    }
}
Vlasec
  • 5,500
  • 3
  • 27
  • 30
  • 1
    I got downvoted apparently. I would like to know why. It could just be someone trolling me, but if there is something wrong about the algorithm, I'd like to know. – Vlasec Oct 03 '16 at 17:07
  • Do you know how to construct a string that contains invalid Unicode char above the max. 0x10FFFF codepoint? The 0x10FFFF shoud correspond to Java string "\udbff\udfff". I tried to construct invalid char 0x110000 which should be Java string "\udbff\ue000". But Java parses this as 2 codepoints. Therefore the last check (codePoint <= 0x10FFFF) seems can't be tested / is useless in real life as Java seems to never return it from the `codePointAt()`. – petrsyn Apr 16 '21 at 15:05
2
String xmlData = xmlData.codePoints().filter(c -> isValidXMLChar(c)).collect(StringBuilder::new,
                StringBuilder::appendCodePoint, StringBuilder::append).toString();

private boolean isValidXMLChar(int c) {
    if((c == 0x9) ||
       (c == 0xA) ||
       (c == 0xD) ||
       ((c >= 0x20) && (c <= 0xD7FF)) ||
       ((c >= 0xE000) && (c <= 0xFFFD)) ||
       ((c >= 0x10000) && (c <= 0x10FFFF)))
    {
        return true;
    }
    return false;
}
Hans Schreuder
  • 745
  • 5
  • 10
1

From Mark McLaren's Weblog

  /**
   * This method ensures that the output String has only
   * valid XML unicode characters as specified by the
   * XML 1.0 standard. For reference, please see
   * <a href="http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char">the
   * standard</a>. This method will return an empty
   * String if the input is null or empty.
   *
   * @param in The String whose non-valid characters we want to remove.
   * @return The in String, stripped of non-valid characters.
   */
  public static String stripNonValidXMLCharacters(String in) {
      StringBuffer out = new StringBuffer(); // Used to hold the output.
      char current; // Used to reference the current character.

      if (in == null || ("".equals(in))) return ""; // vacancy test.
      for (int i = 0; i < in.length(); i++) {
          current = in.charAt(i); // NOTE: No IndexOutOfBoundsException caught here; it should not happen.
          if ((current == 0x9) ||
              (current == 0xA) ||
              (current == 0xD) ||
              ((current >= 0x20) && (current <= 0xD7FF)) ||
              ((current >= 0xE000) && (current <= 0xFFFD)) ||
              ((current >= 0x10000) && (current <= 0x10FFFF)))
              out.append(current);
      }
      return out.toString();
  }   
Renaud
  • 16,073
  • 6
  • 81
  • 79
  • @McDowell could you elaborate what is not covered and why? It's basically the same range as in Jun's answer, which was not downvoted by you. – Danubian Sailor Feb 03 '14 at 13:25
  • 3
    @ŁukaszL. This code tests UTF-16 code units. Jun's code converts to and tests 32-bit code points. For example, the code point U+1D50A is in the supported range 0x10000-0x10FFFF. It must be represented as a surrogate pair in UTF-16 - e.g. the literal `"\uD835\uDD0A"`. The above algorithm will incorrectly drop anything represented by surrogate pairs. See the code point methods on the [Character](http://docs.oracle.com/javase/7/docs/api/java/lang/Character.html) type. – McDowell Feb 04 '14 at 12:49
  • @McDowell I was using the code above, so please tell if I've understood that correctly, I should drop the range 0x10000-0x10FFFF from that code. Instead I should do check Character.isHighSurrogate(current). If so, I should check if next character is Character.isLowSurrogate() and only then add both. "\uD801\uDC00" is a correct Unicode character, while "\uDC00\uD801" is not? – Danubian Sailor Feb 04 '14 at 13:24
  • @ŁukaszL. That will work. See also [here](http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Surrogates). Also, correct, `\uDC00\uD801` is not meaningful data since the pair is backwards - corrupt data. – McDowell Feb 04 '14 at 13:42
  • @McDowell thanks. I've updated my code and made a JUnit test. However, since the question is actually about regex, it's not proper to post here, and it's already similar to Jun's answer. – Danubian Sailor Feb 04 '14 at 13:46
0

From Best way to encode text data for XML in Java?

String xmlEscapeText(String t) {
   StringBuilder sb = new StringBuilder();
   for(int i = 0; i < t.length(); i++){
      char c = t.charAt(i);
      switch(c){
      case '<': sb.append("&lt;"); break;
      case '>': sb.append("&gt;"); break;
      case '\"': sb.append("&quot;"); break;
      case '&': sb.append("&amp;"); break;
      case '\'': sb.append("&apos;"); break;
      default:
         if(c>0x7e) {
            sb.append("&#"+((int)c)+";");
         }else
            sb.append(c);
      }
   }
   return sb.toString();
}
Community
  • 1
  • 1
Roger F. Gay
  • 1,830
  • 2
  • 20
  • 25
  • No. How can one state enumerating chars one by one as the best way I don't get it. – jediz Apr 06 '17 at 10:11
  • There is no alternative to checking them one-by-one. If you use other methods, then the methods must do it - somebody has to. You risk additional overhead if the other method in less efficient.Writing fewer lines in your application isn't the same thing as having the most efficiently running code.. – Roger F. Gay Apr 07 '17 at 12:58
0

If you want to store text elements with the forbidden characters in XML-like form, you can use XPL instead. The dev-kit provides concurrent XPL to XML and XML processing - which means no time cost to the translation from XPL to XML. Or, if you don't need the full power of XML (namespaces), you can just use XPL.

Web Page: HLL XPL

Roger F. Gay
  • 1,830
  • 2
  • 20
  • 25
-1

I believe that the following articles may help you.

http://commons.apache.org/lang/api-2.1/org/apache/commons/lang/StringEscapeUtils.html http://www.javapractices.com/topic/TopicAction.do?Id=96

Shortly, try to use StringEscapeUtils from Jakarta project.

AlexR
  • 114,158
  • 16
  • 130
  • 208
  • 12
    I do not see how this helps the original poster - the problem is that there is a range of characters that just cannot be encoded in XML. These must be handled before you attempt to encode your character data. – McDowell Nov 21 '10 at 13:02