0

I have this xml:

<?xml version="1.0" encoding="UTF-8" ?>
            <rss xmlns:excerpt="http://wordpress.org/export/1.2/excerpt/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:wp="http://wordpress.org/export/1.2/" version="2.0">
                <channel>
                    <wp:wxr_version>1.2</wp:wxr_version>
            <item>
                        <title type="html">
                        <![CDATA[ <h1 class="title">“Title with special character”</h1> ]]>
                        </title>
                        <content:encoded type="html">
                        <![CDATA[ <div class="content clearfix">
            <p>Content Example Text</p>
        </div> ]]>
                        </content:encoded>
                        <wp:post_id>0</wp:post_id>
                        <wp:post_date>2000-09-30T10:22:00.001Z</wp:post_date>           
                    </item>
                </channel>
            </rss>

Inside the html title tag there is the unicode character: U+0007

Why is the xml invalid?

I'm using CDATA, is this not supose to make it valid?

What can I do to validate which symbols are invalid and remove them before constructing the xml?

Laiacy
  • 1,504
  • 13
  • 18
  • Has your question been answered? If so, please [**accept**](https://meta.stackexchange.com/q/5234/234215) the answer that you've found to be most helpful. Thank you. – kjhughes Jul 24 '20 at 02:44

1 Answers1

2

Let's be clear that we're talking about whether the XML is well-formed rather than invalid.

U+0007 is a control character (BEL), used in the past to cause a terminal to beep. It's not allowed in XML, even within CDATA. If it's in the data, then the data is not XML. Your options are to remove it or encode it so that it's not directly in the data (and so that recipients will understand how to decode it); one encoding option would be Base64 for the contents of any element that has to be able to represent such illegal characters.

See also


XML 1.0 vs 1.1

Michael Kay helpfully commented that XML 1.1 allows additional characters, including U+0007 (&#x07;), beyond those allowed in XML 1.0.

For example, consider the following document1:

<?xml version="1.0" encoding="UTF-8" ?>
<r>
  <e1></e1>  <!-- e1 contains a literal U+0007 char -->
  <e2>&#x07;</e2>  <!-- &#x07; becomes a U+0007 char -->
  <e3><![CDATA[]]></e3>  <!-- e3 CDATA contains a literal U+0007 char -->
  <e4><![CDATA[&#x07;]]></e4>  <!-- &#x07; remains an uninterpreted string -->
</r>

With an XML 1.0 version setting in the XML declaration:

  • U+0007 characters within e1, e2, and e3 prevent the XML from being well-formed.

With an XML 1.1 version setting in the XML declaration:

  • U+0007 characters within only e1 and e3 prevent the XML from being well-formed.

1 Note that the question source (viewable via the edit link on the question) does indeed contain literal U+0007 characters where noted even though the formatted XML does not.
kjhughes
  • 106,133
  • 27
  • 181
  • 240
  • A caveat: this character is allowed in XML 1.1, but it must be escaped e.g. as `` I don't recall whether it is allowed unescaped in a CDATA section. To use this, however, the XML declaration must change to say version="1.1", and if you do that then many XML parsers will reject the document, especially those from companies like Microsoft that haven't tracked the W3C standards. – Michael Kay Jul 19 '20 at 17:19
  • @MichaelKay: Good point! Answer updated. Thank you. – kjhughes Jul 19 '20 at 18:10
  • In ``, I think the string value of the text node is the 6-character string `` rather than a single-character string containing BEL. – Michael Kay Jul 20 '20 at 08:53
  • @MichaelKay: Yes, per the semantics of [CDATA sections](https://www.w3.org/TR/xml/#sec-cdata-sect), entities are not interpreted, regardless of XML 1.0 vs 1.1. Therefore, `e4` should not have been as listed as preventing XML 1.0 from being well-formed. Corrected. Thank you. – kjhughes Jul 20 '20 at 13:28