10

XML spec defines a subset of Unicode characters which are allowed in XML documents: http://www.w3.org/TR/REC-xml/#charsets.

How do I filter out these characters from a String in Java?

simple test case:

  Assert.equals("", filterIllegalXML(""+Character.valueOf((char) 2)))
Grzegorz Oledzki
  • 23,614
  • 16
  • 68
  • 106
  • Why are you getting these "illegal" XML characters ? What do you want to do with them once you detect them? delete? replace? – Romain Hippeau May 24 '10 at 13:11
  • @RH: ignoring them would be enough. The best solution would be to delete them and get some kind of report. This way I could log a warning. – Grzegorz Oledzki May 24 '10 at 13:15
  • In case anyone wondered I took advantage of `XMLChar` from Xerces, as suggested by ZZ Coder. You can find the whole method here: http://pastebin.com/6Vbm1zuC – Grzegorz Oledzki May 25 '10 at 06:15

7 Answers7

6

It's not trivial to find out all the invalid chars for XML. You need to call or reimplement the XMLChar.isInvalid() from Xerces,

http://kickjava.com/src/org/apache/xerces/util/XMLChar.java.htm

Yishai
  • 90,445
  • 31
  • 189
  • 263
ZZ Coder
  • 74,484
  • 29
  • 137
  • 169
  • That class is pretty involved [read: hard to understand--for me anyway thanks to its machine generated section], as well as requiring a 64K CHARS array to be instantiated and pre-propagated... – rogerdpack Dec 09 '14 at 21:16
1

This page includes a Java method for stripping out invalid XML characters by testing whether each character is within spec, though it doesn't check for highly discouraged characters

Incidentally, escaping the characters is not a solution since the XML 1.0 and 1.1 specs do not allow the invalid characters in escaped form either.

rogerdpack
  • 62,887
  • 36
  • 269
  • 388
Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
  • 1
    Link is dead...it looks like maybe this is the new URL? http://benjchristensen.com/2008/02/07/how-to-strip-invalid-xml-characters/ – Michael Jan 27 '12 at 15:05
0

Loosely based on a comment in the link from Stephen C's answer, and wikipedia for the XML 1.1 spec here's a java method that shows you how to remove illegal chars using regular expression replace:

boolean isAllValidXmlChars(String s) {
  // xml 1.1 spec http://en.wikipedia.org/wiki/Valid_characters_in_XML
  if (!s.matches("[\\u0001-\\uD7FF\\uE000-\uFFFD\\x{10000}-\\x{10FFFF}]")) {
    // not in valid ranges
    return false;
  }
  if (s.matches("[\\u0001-\\u0008\\u000b-\\u000c\\u000E-\\u001F\\u007F-\\u0084\\u0086-\\u009F]")) {
    // a control character
    return false;
  }

  // "Characters allowed but discouraged"
  if (s.matches(
    "[\\uFDD0-\\uFDEF\\x{1FFFE}-\\x{1FFFF}\\x{2FFFE}–\\x{2FFFF}\\x{3FFFE}–\\x{3FFFF}\\x{4FFFE}–\\x{4FFFF}\\x{5FFFE}-\\x{5FFFF}\\x{6FFFE}-\\x{6FFFF}\\x{7FFFE}-\\x{7FFFF}\\x{8FFFE}-\\x{8FFFF}\\x{9FFFE}-\\x{9FFFF}\\x{AFFFE}-\\x{AFFFF}\\x{BFFFE}-\\x{BFFFF}\\x{CFFFE}-\\x{CFFFF}\\x{DFFFE}-\\x{DFFFF}\\x{EFFFE}-\\x{EFFFF}\\x{FFFFE}-\\x{FFFFF}\\x{10FFFE}-\\x{10FFFF}]"
  )) {
    return false;
  }

  return true;
}
Tombart
  • 30,520
  • 16
  • 123
  • 136
rogerdpack
  • 62,887
  • 36
  • 269
  • 388
0

Using StringEscapeUtils.escapeXml(xml) from commons-lang will escape, not filter the characters.

jediz
  • 4,459
  • 5
  • 36
  • 41
Bozho
  • 588,226
  • 146
  • 1,060
  • 1,140
  • 2
    I am already using this method to escape entities (e.g. `<` to `<`), but that's something different. The method doesn't seem to filter any illegal characters. It fails for my 'test case'. – Grzegorz Oledzki May 24 '10 at 13:06
  • As stated in question: `assertEquals("", StringEscapeUtils.escapeXml(""+Character.valueOf((char) 2)));` – Grzegorz Oledzki May 24 '10 at 13:14
  • ah, sorry. well, I'm not sure there is a way for this character to get into the xml :) Perhaps commons-lang misses it. Actually - what is your version of commons-lang? – Bozho May 24 '10 at 13:18
  • My project is currently using 2.4, but I've just checked that in 2.5 too. There is no difference. – Grzegorz Oledzki May 24 '10 at 13:43
  • From documentation: `StringEscapeUtils.escapeXml(xml)` Supports only the five basic XML entities (gt, lt, quot, amp, apos). Does not support DTDs or external entities. – jediz Apr 06 '17 at 09:59
  • This function is deprecated. It is replaced by either [escapeXml10](https://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/lang3/StringEscapeUtils.html#escapeXml10-java.lang.String-) or [escapeXml11](https://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/lang3/StringEscapeUtils.html#escapeXml11-java.lang.String-). Note that these functions also filter the invalid characters. – stonar96 Dec 27 '19 at 12:31
0

Use either escapeXml10 or escapeXml11. These functions escape characters like ", &, ', <, > and a few more but also filter invalid characters.

For those who don't want to filter invalid characters but escape them with a different escaping system, look at my answer here https://stackoverflow.com/a/59475093/3882565.

stonar96
  • 1,359
  • 2
  • 11
  • 39
0

Here's a solution that takes care of the raw char as well as the escaped char in the stream works with stax or sax. It needs extending for the other invalid chars but you get the idea

import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.io.Reader;
import java.io.UnsupportedEncodingException;
import java.io.Writer;

import org.apache.commons.io.IOUtils;
import org.apache.xerces.util.XMLChar;

public class IgnoreIllegalCharactersXmlReader extends Reader {

    private final BufferedReader underlyingReader;
    private StringBuilder buffer = new StringBuilder(4096);
    private boolean eos = false;

    public IgnoreIllegalCharactersXmlReader(final InputStream is) throws UnsupportedEncodingException {
        underlyingReader = new BufferedReader(new InputStreamReader(is, "UTF-8"));
    }

    private void fillBuffer() throws IOException {
        final String line = underlyingReader.readLine();
        if (line == null) {
            eos = true;
            return;
        }
        buffer.append(line);
        buffer.append('\n');
    }

    @Override
    public int read(char[] cbuf, int off, int len) throws IOException {
        if(buffer.length() == 0 && eos) {
            return -1;
        }
        int satisfied = 0;
        int currentOffset = off;
        while (false == eos && buffer.length() < len) {
            fillBuffer();
        }
        while (satisfied < len && buffer.length() > 0) {
            char ch = buffer.charAt(0);
            final char nextCh = buffer.length() > 1 ? buffer.charAt(1) : '\0';
            if (ch == '&' && nextCh == '#') {
    final StringBuilder entity = new StringBuilder();
    // Since we're reading lines it's safe to assume entity is all
    // on one line so next char will/could be the hex char
    int index = 0;
    char entityCh = '\0';
    // Read whole entity
    while (entityCh != ';') {
        entityCh = buffer.charAt(index++);
        entity.append(entityCh);
    }
    // if it's bad get rid of it and clean it from the buffer and point to next valid char
    if (entity.toString().equals("&#2;")) {
        buffer.delete(0, entity.length());
        continue;
    }
            }
            if (XMLChar.isValid(ch)) {
    satisfied++;
    cbuf[currentOffset++] = ch;
            }
            buffer.deleteCharAt(0);
        }
        return satisfied;
    }

    @Override
    public void close() throws IOException {
        underlyingReader.close();
    }

    public static void main(final String[] args) {
        final File file = new File(
    <XML>);
        final File outFile = new File(file.getParentFile(), file.getName()
    .replace(".xml", ".cleaned.xml"));
        Reader r = null;
        Writer w = null;
        try {
            r = new IgnoreIllegalCharactersXmlReader(new FileInputStream(file));
            w = new OutputStreamWriter(new FileOutputStream(outFile),"UTF-8");
            IOUtils.copyLarge(r, w);
            w.flush();
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            IOUtils.closeQuietly(r);
            IOUtils.closeQuietly(w);
        }
    }
}
gomesla
  • 76
  • 1
  • 1
  • 5
-1

You can use regex (Regular Expression) to do the work, see an example in the comments here

rogerdpack
  • 62,887
  • 36
  • 269
  • 388
The Student
  • 27,520
  • 68
  • 161
  • 264