retrieve data from xml tag Python

Question

I am trying to retrieve the slide number between the 'a:t' tags when type = "slidenum" using the following code but something is not working. I'm supposed to get 1.

Here's the XML:

<a:p><a:fld id="{55FBEE69-CA5C-45C8-BA74-481781281731}" type="slidenum">
<a:rPr lang="en-US" sz="1300" i="0"><a:solidFill><a:srgbClr val="000000"/>
</a:solidFill></a:rPr><a:pPr/><a:t>1</a:t></a:fld><a:endParaRPr lang="en-US" 
sz="1300" i="0"><a:solidFill><a:srgbClr val="000000"/></a:solidFill>
</a:endParaRPr></a:p></p:txBody></p:sp>

Here's my code

    z = zipfile.ZipFile(pptx_filename)
    for name in z.namelist():
      m = re.match(r'ppt/notesSlides/notesSlide\d+\.xml', name)
    if m is not None:
        f = z.open(name)
        tree = ET.parse(f)
        f.close()
        root = tree.getroot()
        # Find the slide number.
        slide_num = None
        for fld in root.findall('/'.join(['.', '', p.txBody, a.p, a.fld])):
            if fld.get('type', '') == 'slidenum':
                slide_num = int(fld.find(a.t).text)
                print slide_num

Could you edit the question to include the XML? I think that would help us a lot :) Its hard to read it in the comment — Jerfov2, Jun 30 '15 at 01:59
`a:` implies that these elements are in an XML namespace. You probably need to include the namespace when searching for these tags. If you're unsure how to do that you should checkout this answer: http://stackoverflow.com/a/14853417/849425 — , Jun 30 '15 at 02:20
Following up on my previous comment the XML shown above is actually invalid as it does not define the `a` namespace. Also your opening and closing tags are not the same. — , Jun 30 '15 at 02:24

score 0 · Answer 1 · answered Jun 30 '15 at 02:30

I would remove the namespace tags from your xml before parsing. Then use the XPATH fld[@type='slidenum']/t to find all nodes of type fld with fld[@type='slidenum']/t and child node t. Here's an example to show how this might work:

from lxml import etree

xml = """
<a:p><a:fld id="{55FBEE69-CA5C-45C8-BA74-481781281731}" type="slidenum">
<a:rPr lang="en-US" sz="1300" i="0"><a:solidFill><a:srgbClr val="000000"/>
</a:solidFill></a:rPr><a:pPr/><a:t>1</a:t></a:fld><a:endParaRPr lang="en-US" 
sz="1300" i="0"><a:solidFill><a:srgbClr val="000000"/></a:solidFill>
</a:endParaRPr></a:p>
"""

tree = etree.fromstring(xml.replace('a:',''))
slidenum = tree.find("fld[@type='slidenum']/t").text
print(slidenum)
1

XML namespaces are typically defined to remove ambiguity in element names. Removing them could have unintended consequences depending on the structure of the document. I'm assuming the XML shown by the OP is a snippet of a larger document - in part because it's malformed (which to me implies it was copy-and-pasted incorrectly) and also because it appears to be a PowerPoint slide deck in XML format. (Microsoft Office's XML formats are notoriously verbose.) — , Jun 30 '15 at 02:33

score 0 · Accepted Answer · 2015-06-30T04:07:14.907

0

Modified from Moxymoo's answer below to use namespaces instead of removing them:

# cElementTree is the faster, C language based big brother of ElementTree
from xml.etree import cElementTree as etree

# Our test XML
xml = '''
<a:p xmlns:a="http://example.com"><a:fld id="{55FBEE69-CA5C-45C8-BA74-481781281731}" type="slidenum">
<a:rPr lang="en-US" sz="1300" i="0"><a:solidFill><a:srgbClr val="000000"/>
</a:solidFill></a:rPr><a:pPr/><a:t>1</a:t></a:fld><a:endParaRPr lang="en-US" 
sz="1300" i="0"><a:solidFill><a:srgbClr val="000000"/></a:solidFill>
</a:endParaRPr></a:p>
'''

# Manually specify the namespace. The prefix letter ("a") is arbitrary.
namespaces = {"a":"http://example.com"}

# Parse the XML string
tree = etree.fromstring(xml)

"""
Breaking down the search expression below
  a:fld - Find the fld element prefixed with namespace identifier a:
  [@type='slidenum'] - Match on an attribute type with a value of 'slidenum'
  /a:t - Find the child element t prefixed with namespace identifier a:
"""
slidenums = tree.findall("a:fld[@type='slidenum']/a:t", namespaces)
for slidenum in slidenums:
    print(slidenum.text)

Here's the same example using an external file using the namespace provided by the OP below:

from xml.etree import cElementTree as etree

tree = etree.parse("my_xml_file.xml")
namespaces = {"a":"http://schemas.openxmlformats.org/presentationml/2006/main"}
slidenums = tree.findall("a:fld[@type='slidenum']/a:t", namespaces)
for slidenum in slidenums:
    print(slidenum.text)

edited Jun 30 '15 at 04:07

answered Jun 30 '15 at 02:47

Hey Mike! Thank you for your reply ! the xml I am using is just a snippet and the code doesn't work when I use the whole file. 'tree = parse(file)' How do I use your code after parsing the file? – eleanor massy Jun 30 '15 at 03:19
@eleanormassy I put in a fake namespace URL because it's not obvious from the XML example you gave what the real namespace URL is. You probably need to change that URL to the one in your XML file. (You'll see it defined as an attribute `xmlns:a=""` – Jun 30 '15 at 03:20
yes, I got that part and i changed the url to the one in my file! How do I use your code after 'tree = parse(file)' ? Thanks – eleanor massy Jun 30 '15 at 03:23
The correct statement to parse a file with cElementTree in my example is `tree = etree.parse(file)`. Is that what you were asking? – Jun 30 '15 at 03:28
Not exactly. This is my new code. It returns None. It works with the snippet but not the whole xml file. `tree = ET.parse(f)` `root = tree.getroot()` `str = ET.tostring(root)` `namespaces =` `{"a":"http://schemas.openxmlformats.org/presentationml/2006/main"}` `tree = etree.fromstring(str)` ` slidenums = tree.findall("a:fld[@type='slidenum']/a:t", namespaces)` `for slidenum in slidenums:` `print(slidenum.text)` – eleanor massy Jun 30 '15 at 03:40
@eleanormassy - Updated the answer to provide an example of loading from an external file. – Jun 30 '15 at 04:08
Still not working :/ !! I don't understand why. It accesses the file and everything.. – eleanor massy Jun 30 '15 at 04:27
Here's a good trick for troubleshooting: under the line that starts `slidenums...` put `import pdb; pdb.set_trace()`. The next time you run the program Python will pause at that line and let you issue commands to inspect variables and execute functions. A `print slidenums` should tell you if there are any items that were found in the `slidenums` variable. – Jun 30 '15 at 04:50
Thanks for your help ! :) – eleanor massy Jun 30 '15 at 18:43

retrieve data from xml tag Python

2 Answers2