How do you use Python's xml library to parse the character &?

Question

I am running the following code, however the result I get is hiding the strings following the & character. Is there a way I can force walk through the children in the xml and return the proper text?

import xml.etree.ElementTree as ET
file="/home/pi/bin/test/test_xml3.xml"
parser = ET.XMLParser(encoding="ascii")

root = ET.parse(file)

for elements in root.iter('kiddy'): #iterate through each element
    print elements.text

The example file which is causing the issue is this, specifically the result strips out the quot; and amp; strings:

<root>
<kiddy> shghsgdh &amp; sdjhgsjhsjdh &amp; sjhsjhdsjdh </kiddy>
<kiddy> xxxx &amp; xxxxx &amp; xxxxx </kiddy>
</root>

The output, as you can see, is missing the amp; string:

shghsgdh & sdjhgsjhsjdh & sjhsjhdsjdh
xxxx & xxxxx & xxxxx

Hi, i have added the ouput, as you can see the ouput is missing the 'amp', i also tried using: elements.text.encode('ascii') but not change..... you have any idea's? — Chris Atkinson, Oct 06 '14 at 22:21
Take out the `encoding=ascii` from your code. It doesn't help with the escaped entities, and it will give you problems if there are accented characters or any other non-ascii symbols in your text. — alexis, Oct 07 '14 at 11:19

Yoel · Accepted Answer · 2014-10-07T09:42:52.860

amp; is missing from your output because:

The ampersand character (&) and the left angle bracket (<) may appear in their literal form only when used as markup delimiters, or within a comment, a processing instruction, or a CDATA section. If they are needed elsewhere, they must be escaped using either numeric character references or the strings "&ampamp;" and "&amplt;".

Therefore, when the parser encounters &, it parses it to be just a single &.

If you're truly interested in the original string, I suggest you escape the relevant section via a CDATA section (a CDATA section starts with <![CDATA[ and ends with ]]>), as follows:

<root>
<kiddy> shghsgdh ; sdjhgsjhsjdh ;  sjhsjhdsjdh </kiddy>
<kiddy name="All Shows" thumb="special://home/addons/plugin.video.plexbmc/resources/plex.png"><![CDATA[ActivateWindow(10025,&quot;plugin://plugin.video.plexbmc/?mode=0&amp;url=http%3a%2f%2f192.168.0.1%3a32400%2flibrary%2fsections%2f2%2fall&quot;,return)]]></kiddy>
</root>

This is a link to a concise read on the subject.

To illustrate this better, I'll show you how this should look with your updated example (for completeness' sake I've added another line that includes the string "):

<root>
<kiddy><![CDATA[ shghsgdh &amp; sdjhgsjhsjdh &amp; sjhsjhdsjdh ]]></kiddy>
<kiddy><![CDATA[ xxxx &amp; xxxxx &amp; xxxxx ]]></kiddy>
<kiddy><![CDATA[ xxxx &quot; xxxxx &quot; xxxxx ]]></kiddy>
</root>

Alternatively, you can also escape just those particular & character you want to, by adding the string amp; after each & character, thus creating the escaped string & which is parsed to &. This can be safely followed by your original string (amp; or quot;), without fear of it being escaped, as it is not prefixed by the character &. I hope that an example would clarify this (imagine how each & is parsed to the character &):

<root>
<kiddy> shghsgdh &amp;amp; sdjhgsjhsjdh &amp;amp; sjhsjhdsjdh </kiddy>
<kiddy> xxxx &amp;amp; xxxxx &amp;amp; xxxxx </kiddy>
<kiddy> xxxx &amp;quot; xxxxx &amp;quot; xxxxx </kiddy>
</root>

score 1 · Answer 2 · edited May 23 '17 at 11:57

1

It's not that amp; is missing, it's that & is the XML representation of &-- it's being decoded for you. If you generate XML with ElementTree the reverse will happen, so there's nothing to worry about-- just work with the decoded text.

But if you really need to see the XML entities in your strings for some reason, you can always edit them back in:

text = re.sub(r'"', r"&quot;", text)
text = re.sub(r"&", r"&amp;", text)

Edit: If you really want to re-escape the XML entities, it would be better form to use a library function, perhaps xml.dom.minidom as described here. But I can't think of any good reason you'd need to do this; you can't even use the escaped strings if you use a library to generate XML, because the library will escape the escapes. What ElementTree gives you is ASCII (or unicode, but this has nothing to do with the entity escaping), and you should work with that.

edited May 23 '17 at 11:57

Community

1
1

answered Oct 06 '14 at 23:11

alexis

48,685
16
101
161

Thank you so much, this makes perfect sense, I was thinking there was a method of adjusting the output into ASCII which would solve the problem. That would have been graceful, but I see now that simply performing a replace as per your code above will work equally well. Thanks – Chris Atkinson Oct 07 '14 at 06:51
1

To reiterate, what you get _is_ ascii. What you want is the original XML-encoded text. If you really do want to restore the escaped entities, search your document to see what other entities it might encode and use a python library function to re-encode them. – alexis Oct 07 '14 at 09:36
Note that while this works on most cases, it might perform some redundant replaces if `&` is used as a markup delimiter, or within a comment, a processing instruction, or a `CDATA` section, as [`&` doesn't have any special meaning in such cases](http://www.w3.org/TR/2000/REC-xml-20001006#syntax), so a more sophisticated replace function might be in order. Anyway, this answer earned my up-vote nonetheless since it doesn't seem to be an issue in this case, but mainly because of the first paragraph which is missing from [my answer](http://stackoverflow.com/a/26226043/3903832). – Yoel Oct 07 '14 at 09:37
@Yoel, thanks. Adding a library-based solution when I get a chance. I'm still not sure what the use case _really_ is here, though. – alexis Oct 07 '14 at 09:40

How do you use Python's xml library to parse the character &?

2 Answers2