0

I have XML (generated elsewhere, no control over it) which contains nasty nested CDATA, such as for example:

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE prc SYSTEM "prc.dtd">
<body>
  <![CDATA[Towards Automatic Generation blabla
<definition> 
   <query><![CDATA[ <root[AByS]> <sc methodName="get_NYT.ARTICLES" serviceURL="http://www.nytimes.com/srv/"> 
  <params> <param name="subjectP" value="{ subjectP }"> </> </> </> <sc methodName="get_WP.ARTICLES" 
   serviceURL="http://www.wpost.com/srv/"> <params> <param name="subjectP" value="{ subjectP }"> </> </> 
   </> </>; ]]></query> </definition> </serviceDefinition> (b) Figure 7. (a) The query for Web service 
]]>
</body>

lxml (Python) bombs with

XMLSyntaxError: Opening and ending tag mismatch: body line 3 and query, line 9, column 28

because it thinks the first ]]> ends the CDATA, where in reality it only ends the inner CDATA and the following tag, </query>, is still within the outer CDATA and shouldn't be parsed.

What is a good way to parse such XML? Meaning I want everything inside CDATA to remain as unparsed data, even if it contains more CDATA inside. Write my own parser? Ideas?

jasso
  • 13,736
  • 2
  • 36
  • 50
user124114
  • 8,372
  • 11
  • 41
  • 63

2 Answers2

2

Since nesting CDATA section makes it not well-fromed XML, you cannot use any XML tools on it.

You need to use text parser that can handle nested structures, so it needs a counter or stack support. This rules out simple regex solutions. If the CDATA sections are balanced, the task is somewhat comparable to handling nested parenthesis.

A way to unfold nested CDATA sections is to make them sequential CDATA sections.

Some pseudocode:

counter = 0 or stack is empty
when found "<![CDATA[" string
    if counter != 0 or stack not empty
        replace "<![CDATA[" with "]]><![CDATA["
    increase counter or push to stack
when found "]]>" string
    decrease counter or pop stack
    if counter != 0 or stack not empty
        replace "]]>" with "]]><![CDATA["

Ideally you could use this as an input stream reader that could pipe the output to your XML parser.

jasso
  • 13,736
  • 2
  • 36
  • 50
1

Nested CDATA is not legal, so this is not valid XML.

CDATA sections may not contain "]]>". The proper way to escape it in XML is like this "]]]]>"

See this question for more detail

Community
  • 1
  • 1
Stanley De Boer
  • 4,921
  • 1
  • 23
  • 31
  • Hmm, I don't see how such escaping could work... nor how that information would help me parse the malformed XML, even if it worked. – user124114 Feb 19 '13 at 21:38