4

I am trying to use xmltodict to manipulate an XML content as python object, but I am facing an issue to handle properly CDATA. I think I am missing something somewhere, this is my code:

import xmltodict

data = """<node1>
    <node2 id='test'><![CDATA[test]]></node2>
    <node3 id='test'>test</node3>
</node1>"""

data = xmltodict.parse(data,force_cdata=True, encoding='utf-8')
print data

print xmltodict.unparse(data, pretty=True)  

And this is the output:

OrderedDict([(u'node1', OrderedDict([(u'node2', OrderedDict([(u'@id', u'test'), ('#text', u'test')])), (u'node3', OrderedDict([(u'@id', u'test'), ('#text', u'test')]))]))])
<?xml version="1.0" encoding="utf-8"?>
<node1>
        <node2 id="test">test</node2>
        <node3 id="test">test</node3>
</node1>

We can see here that the CDATA is missing in the generated node2, and also node2 is the same as node3. However, in the input the nodes are different.

Regards

hzrari
  • 1,803
  • 1
  • 15
  • 26

2 Answers2

2

I want to clarify that there is no officially supported way to keep the CDATA section.

You could check the issue here.

Based on the above facts, you need DIY. There are two approaches:

Firstly, let's create some helper functions.

def cdata(s):
    return '<![CDATA[' + s + ']]>'

def preprocessor(key, value):
    '''Unneccessary if you've manually wrapped the values. For example,

    xmltodict.unparse({
        'node1': {'node2': '<![CDATA[test]]>', 'node3': 'test'}
    })
    '''

    if key in KEEP_CDATA_SECTION:
        if isinstance(value, dict) and '#text' in value:
            value['#text'] = cdata(value['#text'])
        else:
            value = cdata(value)
    return key, value
  1. Unescaping the escaped XML
import xmltodict
from xml.sax.saxutils import unescape

KEEP_CDATA_SECTION = ['node2']

out_xml = xmltodict.unparse(data, preprocessor=preprocessor)
out_xml = unescape(out_xml) # not safe !

You shall not try it on the untrusted data, cuz this approach not only unescapes the character data but also unescapes the nodes' attributes.

  1. Subclassing XMLGenerator

To alleviate the safety problem of unescape() , we could remove the escape() call in XMLGenerator so that there is no need to unescape the XML again.

class XMLGenerator(xmltodict.XMLGenerator):
    def characters(self, content):
        if content:
            self._finish_pending_start_element()
            self._write(content) # also not safe, but better !

xmltodict.XMLGenerator = XMLGenerator

It is not a hack, so it won't change the rest behavior of xmltodict other than unparse() . More importantly, it won't pollute the built-in library xml .

For one-line fans.

xmltodict.XMLGenerator.characters = xmltodict.XMLGenerator.ignorableWhitespace # now, it is a hack !

Even more, you can wrap the character data directly in XMLGenerator like the following.

class XMLGenerator(xmltodict.XMLGenerator):
    def characters(self, content):
        if content:
            self._finish_pending_start_element()
            self._write(cdata(content))

From now on, every nodes having character data will keep the CDATA section.

hyouka
  • 21
  • 4
0

I finally managed to get it working by performing this monkey-patch. I am still not very happy with it, It's really a 'hack' this feature should be included somewhere properly:

import xmltodict
def escape_hacked(data, entities={}):
    if data[0] == '<' and  data.strip()[-1] == '>':
        return '<![CDATA[%s]]>' % data

    return escape_orig(data, entities)


xml.sax.saxutils.escape = escape_hacked

and then run your python code normally:

data = """<node1>
    <node2 id='test'><![CDATA[test]]></node2>
    <node3 id='test'>test</node3>
</node1>"""

data = xmltodict.parse(data,force_cdata=True, encoding='utf-8')
print data

print xmltodict.unparse(data, pretty=True) 

To explain, the following line detect if the data is a valid XML, then it add the CDATA tag arround it:

    if data[0] == '<' and  data.strip()[-1] == '>':
        return '<![CDATA[%s]]>' % data

Regards

hzrari
  • 1,803
  • 1
  • 15
  • 26