I have XML (generated elsewhere, no control over it) which contains nasty nested CDATA, such as for example:
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE prc SYSTEM "prc.dtd">
<body>
<![CDATA[Towards Automatic Generation blabla
<definition>
<query><![CDATA[ <root[AByS]> <sc methodName="get_NYT.ARTICLES" serviceURL="http://www.nytimes.com/srv/">
<params> <param name="subjectP" value="{ subjectP }"> </> </> </> <sc methodName="get_WP.ARTICLES"
serviceURL="http://www.wpost.com/srv/"> <params> <param name="subjectP" value="{ subjectP }"> </> </>
</> </>; ]]></query> </definition> </serviceDefinition> (b) Figure 7. (a) The query for Web service
]]>
</body>
lxml
(Python) bombs with
XMLSyntaxError: Opening and ending tag mismatch: body line 3 and query, line 9, column 28
because it thinks the first ]]>
ends the CDATA, where in reality it only ends the inner CDATA and the following tag, </query>
, is still within the outer CDATA and shouldn't be parsed.
What is a good way to parse such XML? Meaning I want everything inside CDATA to remain as unparsed data, even if it contains more CDATA inside. Write my own parser? Ideas?