29

I have an XML file that's the output from a database. I'm using the Java SAX parser to parse the XML and output it in a different format. The XML contains some invalid characters and the parser is throwing errors like 'Invalid Unicode character (0x5)'

Is there a good way to strip all these characters out besides pre-processing the file line-by-line and replacing them? So far I've run into 3 different invalid characters (0x5, 0x6 and 0x7). It's a ~4gb database dump and we're going to be processing it a bunch of times, so having to wait an extra 30 minutes each time we get a new dump to run a pre-processor on it is going to be a pain, and this isn't the first time I've run into this issue.

Mason
  • 8,767
  • 10
  • 33
  • 34
  • 2
    Do the characters have any meaning? Presumably they aren't random corruption, so doesn't stripping them remove information? – Bart Schuller Sep 18 '08 at 17:32
  • If the file contains invalid characters, it isn't an XML file. Ask the creators of it to create only well-formed XML in future. I've had this problem a lot in the past. People don't seem to understand that XML needs to be well-formed and not contain rubbish. – MarkR Sep 18 '08 at 15:39
  • I agree 100% Unfortunately it's not always possible (incompetent tech people, contract wording, etc) – Mason Sep 18 '08 at 15:41

6 Answers6

23

I used Xalan org.apache.xml.utils.XMLChar class:

public static String stripInvalidXmlCharacters(String input) {
    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < input.length(); i++) {
        char c = input.charAt(i);
        if (XMLChar.isValid(c)) {
            sb.append(c);
        }
    }

    return sb.toString();
}
Bozho
  • 588,226
  • 146
  • 1,060
  • 1,140
  • I think this one will not work for surrogate characters: `XMLChar#isValid()` will return false for the high and low parts separately, but would return true if the pair together would be valid. – ankon Feb 24 '15 at 09:42
10

I haven't used this personally but Atlassian made a command line XML cleaner that may suit your needs (it was made mainly for JIRA but XML is XML):

Download atlassian-xml-cleaner-0.1.jar

Open a DOS console or shell, and locate the XML or ZIP backup file on your computer, here assumed to be called data.xml

Run: java -jar atlassian-xml-cleaner-0.1.jar data.xml > data-clean.xml

This will write a copy of data.xml to data-clean.xml, with invalid characters removed.

18Rabbit
  • 3,191
  • 2
  • 25
  • 24
  • Is the link broken for anyone else? – But I'm Not A Wrapper Class Aug 26 '13 at 20:48
  • @CyberneticTwerkGuruOrc It is. Here's another link I found for it: https://confluence.atlassian.com/download/attachments/12079/atlassian-xml-cleaner-0.1.jar?version=1&modificationDate=1307570821061&api=v2 – cyroxx Mar 14 '14 at 14:02
  • If building an add-on for marketplace, the same class that replaces invalid characters is available on com.atlassian.core.util.xml.XMLCleaningReader – Vitor Pelizza Aug 03 '16 at 17:25
  • Message from the future (2020) - the second link worked for me and this JAR solved a severe problem I had with thousands of XML files that contained various illegal characters. Running them through this utility standardized them and made them parseable by Python's lxml library. The future thanks you. – lonstar Sep 14 '20 at 16:55
8

I use the following regexp that seems to work as expected for the JDK6:

Pattern INVALID_XML_CHARS = Pattern.compile("[^\\u0009\\u000A\\u000D\\u0020-\\uD7FF\\uE000-\\uFFFD\uD800\uDC00-\uDBFF\uDFFF]");
...
INVALID_XML_CHARS.matcher(stringToCleanup).replaceAll("");

In JDK7 it might be possible to use the notation \x{10000}-\x{10FFFF} for the last range that lies outside of the BMP instead of the \uD800\uDC00-\uDBFF\uDFFF notation that is not as simple to understand.

ogrisel
  • 39,309
  • 12
  • 116
  • 125
3

I have a similar problem when parsing content of an Australian export tariffs into an XML document. I cannot use solutions suggested here such as: - Use an external tool (a jar) invoked from command line. - Ask Australian Customs to clean up the source file.

The only method to solve this problem at the moment is to iterate through the entire content of the source file, character by character and test if each character does not belong to the ascii range 0x00 to 0x1F inclusively. It can be done, but I was wondering if there is a better way using Java methods for type String.

EDIT I found a solution that may be useful to others: Use Java method String#ReplaceAll to replace or remove any undesirable characters in XML document.

Example code (I removed some necessary statements to avoid clutter):

BufferedReader reader = null;
...
String line = reader.readLine().replaceAll("[\\x00-\\x1F]", "");

In this example I remove (i.e. replace with an empty string), non-printable characters within range 0x00 to 0x1F inclusively. You can change the second argument in method #replaceAll() to replace characters with the string your application requires.

RealHowTo
  • 34,977
  • 11
  • 70
  • 85
jankar
  • 31
  • 2
0

Is it possible your invalid characters are present only within the values and not the tags themselves i.e. the XML notionally meets the schema but the values have not been properly sanitized? If so, what about overriding InputStream to create a CleansingInputStream that replaces your invalid characters with their XML equivalents?

scotty
  • 84
  • 4
0

Your problem does not concern XML: it concerns character encodings. What it comes down to is that every string, be it XML or otherwise, consists of bytes and you cannot know what characters these bytes represent, unless you are told what character encoding the string has. If, for instance, the supplier tells you it's UTF-8 and it's actually something else, you are bound to run into problems. In the best case, everything works, but some bytes are translated into 'wrong' characters. In the worst case you get errors like the one you encountered.

Actually, your problem is even worse: your string contains byte sequences that do not represent characters in any character encoding. There is no texthandling tool, let alone an XML parser, that can help you here. This needs byte-level cleaning up.

Confusion
  • 16,256
  • 8
  • 46
  • 71