An invalid XML character (Unicode: 0xc) was found

Question

Parsing an XML file using the Java DOM parser results in:

[Fatal Error] os__flag_8c.xml:103:135: An invalid XML character (Unicode: 0xc) was found in the element content of the document.
org.xml.sax.SAXParseException: An invalid XML character (Unicode: 0xc) was found in the element content of the document.
    at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown Source)
    at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unknown Source)
    at javax.xml.parsers.DocumentBuilder.parse(Unknown Source)

score 49 · Accepted Answer · edited May 23 '17 at 12:17

49

There are a few characters that are dissallowed in XML documents, even when you encapsulate data in CDATA-blocks.

If you generated the document you will need to ~~entity encode it or~~ strip it out. If you have an errorneous document, you should strip away these characters before trying to parse it.

See dolmens answer in this thread: Invalid Characters in XML

Where he links to this article: http://www.w3.org/TR/xml/#charsets

Basically, all characters below 0x20 is disallowed, except 0x9 (TAB), 0xA (CR?), 0xD (LF?)

edited May 23 '17 at 12:17

Community

1
1

answered Apr 21 '11 at 10:07

jishi

24,126
6
49
75

1

+1 - basically, the OP's problem is that the XML file he is trying to parse is invalid. – Stephen C Apr 21 '11 at 10:26
8

entity encoding won't work; the value simply isn't allowed in XML text – Anon Apr 21 '11 at 11:12
On UTF-8, the complete list of unallowed chars are these 5 hexa intervals: `0..8`, `B..C`, `E..1F`, `D800..DFFF`, `FFFE..FFFF` – Topera Nov 29 '21 at 13:44
@Topera are the ranges inclusive? – Tommaso Thea Aug 21 '23 at 20:06

score 21 · Answer 2 · answered Nov 12 '15 at 16:46

21

public String stripNonValidXMLCharacters(String in) {
    StringBuffer out = new StringBuffer(); // Used to hold the output.
    char current; // Used to reference the current character.

    if (in == null || ("".equals(in))) return ""; // vacancy test.
    for (int i = 0; i < in.length(); i++) {
        current = in.charAt(i); // NOTE: No IndexOutOfBoundsException caught here; it should not happen.
        if ((current == 0x9) ||
            (current == 0xA) ||
            (current == 0xD) ||
            ((current >= 0x20) && (current <= 0xD7FF)) ||
            ((current >= 0xE000) && (current <= 0xFFFD)) ||
            ((current >= 0x10000) && (current <= 0x10FFFF)))
            out.append(current);
    }
    return out.toString();
}

answered Nov 12 '15 at 16:46

Dima

1,045
14
23

If you could Write a Regex based solution that would be robust and fast – Mubasher Jul 19 '16 at 13:08
regex is generally slower, the above code would be faster since it only does this one thing – Sarel Botha May 16 '18 at 15:14
2

Now instead of `StringBuffer` use `StringBuilder` because it is faster (does not require an Object monitor/is unsynchronized). – michaeak Nov 05 '20 at 14:56

score 8 · Answer 3 · edited May 07 '16 at 07:10

8

Whenever invalid xml character comes xml, it gives such error. When u open it in notepad++ it look like VT, SOH,FF like these are invalid xml chars. I m using xml version 1.0 and i validate text data before entering it in database by pattern

Pattern p = Pattern.compile("[^\u0009\u000A\u000D\u0020-\uD7FF\uE000-\uFFFD\u10000-\u10FFF]+"); 
retunContent = p.matcher(retunContent).replaceAll("");

It will ensure that no invalid special char will enter in xml

edited May 07 '16 at 07:10

SkyWalker

28,384
14
74
132

answered Dec 31 '14 at 10:33

Komal

200
2
5

1

The pattern you provide is correct, but does not compile as it is. You need some escaping. The correct is `Pattern.compile("[^\\u0009\\u000A\\u000D\\u0020-\\uD7FF\\uE000-\\uFFFD\\u10000-\\u10FFF]+")` – k.liakos May 28 '17 at 15:34

score 6 · Answer 4 · answered Apr 21 '11 at 11:09

6

The character 0x0C is be invalid in XML 1.0 but would be a valid character in XML 1.1. So unless the xml file specifies the version as 1.1 in the prolog it is simply invalid and you should complain to the producer of this file.

answered Apr 21 '11 at 11:09

Jörn Horstmann

33,639
11
75
118

score 3 · Answer 5 · answered Jun 15 '17 at 13:25

You can filter all 'invalid' chars with a custom FilterReader class:

public class InvalidXmlCharacterFilter extends FilterReader {

    protected InvalidXmlCharacterFilter(Reader in) {
        super(in);
    }

    @Override
    public int read(char[] cbuf, int off, int len) throws IOException {
        int read = super.read(cbuf, off, len);
        if (read == -1) return read;

        for (int i = off; i < off + read; i++) {
            if (!XMLChar.isValid(cbuf[i])) cbuf[i] = '?';
        }
        return read;
    }
}

And run it like this:

InputStream fileStream = new FileInputStream(xmlFile);
Reader reader = new BufferedReader(new InputStreamReader(fileStream, charset));
InvalidXmlCharacterFilter filter = new InvalidXmlCharacterFilter(reader);
InputSource is = new InputSource(filter);
xmlReader.parse(is);

Hi Vadim, your idea is great. What is the source of XMLChar? — Iliya Kuznetsov, Feb 20 '20 at 19:02
I found XMLChar on com.sun.org.apache.xml.internal.utils.XMLChar (inside Java 1.8) — Topera, Nov 26 '21 at 20:20
java poi how to ignore these invalid characters? `Workbook workbook = new XSSFWorkbook(fileLocation);` — zhuguowei, Jan 26 '22 at 07:50

Topera · Answer 6 · 2021-11-29T16:47:54.997

On UTF-8, all the codes on these ranges are not allowed, for XML 1.0:

0..8
B..C
E..1F
D800..DFFF
FFFE..FFFF

A regex that can remove then is:

text.replaceAll('[\\x{0}-\\x{8}]|[\\x{B}-\\x{C}]|[\\x{E}-\\x{1F}]|[\\x{D800}-\\x{DFFF}]|[\\x{FFFE}-\\x{FFFF}]', "")

Note: if you are working with XML 1.1, you also need to remove these intervals:

7F..84
86..9F

Refs:

XML 1.0 chars: https://www.w3.org/TR/xml/#charsets
XML 1.1 chars: https://www.w3.org/TR/xml11/#charsets

Martin Husted Hartvig · Answer 7 · 2023-06-28T17:30:55.597

1

I just used this project, and found it very handy: https://github.com/rwitzel/streamflyer

Using the InvalidXmlCharacterModifier, as the documentation says.

Like this example:

public String stripNonValidXMLCharacters(final String in) {

  final Modifier modifier = new InvalidXmlCharacterModifier("",
    InvalidXmlCharacterModifier.XML_10_VERSION);

  final ModifyingReader modifyingReader = 
         new ModifyingReader(new StringReader(in), modifier);

  return IOUtils.toString(modifyingReader);
}

edited Jun 28 '23 at 17:30

answered Jun 23 '23 at 12:02

Martin Husted Hartvig

54
3

[Link only answers](https://meta.stackexchange.com/questions/8231/are-answers-that-just-contain-links-elsewhere-really-good-answers/8259#8259) are considered very low quality and [can get deleted](https://stackoverflow.com/help/deleted-answers), please put the important parts from the linked resource into the answer body. – helvete Jun 27 '23 at 16:39
1

Hmm... just added an example @helvete – Martin Husted Hartvig Jun 28 '23 at 17:32

score 0 · Answer 8 · answered Feb 14 '14 at 08:46

I faced a similar issue where XML was containing control characters. After looking into the code, I found that a deprecated class,StringBufferInputStream, was used for reading string content.

http://docs.oracle.com/javase/7/docs/api/java/io/StringBufferInputStream.html

This class does not properly convert characters into bytes. As of JDK 1.1, the preferred way to create a stream from a string is via the StringReader class.

I changed it to ByteArrayInputStream and it worked fine.

score 0 · Answer 9 · answered Dec 13 '17 at 16:29

For people who are reading byte array into String and trying to convert to object with JAXB, you can add "iso-8859-1" encoding by creating String from byte array like this:

String JAXBallowedString= new String(byte[] input, "iso-8859-1");

This would replace the conflicting byte to single-byte encoding which JAXB can handle. Obviously this solution is only to parse the xml.

score 0 · Answer 10 · answered Jun 05 '19 at 15:26

0

All of these answers seem to assume that the user is generating the bad XML, rather than receiving it from gSOAP, which should know better!

answered Jun 05 '19 at 15:26

Jerry Miller

921
1
8
11

Then again, it could be a memory access issue that corrupts the content. – Jerry Miller Jun 05 '19 at 15:28

score 0 · Answer 11 · answered Jan 29 '20 at 10:55

Today, I've got a similar error:

Servlet.service() for servlet [remoting] in context with path [/***] threw exception [Request processing failed; nested exception is java.lang.RuntimeException: buildDocument failed.] with root cause org.xml.sax.SAXParseException; lineNumber: 19; columnNumber: 91; An invalid XML character (Unicode: 0xc) was found in the value of attribute "text" and element is "label".

After my first encouter with the error, I had re-typed the entire line by hand, so that there was no way for a special character to creep in, and Notepad++ didn't show any non-printable characters (black on white), nevertheless I got the same error over and over.

When I looked up what I've done different than my predecessors, it turned out it was one additional space just before the closing /> (as I've heard was recommended for older parsers, but it shouldn't make any difference anyway, by the XML standards):

<label text="this label's text" layout="cell 0 0, align left" />

When I removed the space:

<label text="this label's text" layout="cell 0 0, align left"/>

everything worked just fine.

So it's definitely a misleading error message.

An invalid XML character (Unicode: 0xc) was found

11 Answers11

Linked