Python parse XML has multiple root

Question

I failed to parse an XML file(it is GC history). Sample of the XML is shown below.

<?xml version="1.0" ?>

<verbosegc xmlns="http://www.ibm.com/j9/verbosegc" version="R28_jvm.28_20150612_0201_B252774_CMPRSS">

<initialized id="1" timestamp="2015-12-04T20:17:07.219">
  <attribute name="gcPolicy" value="-Xgcpolicy:gencon" />
  <attribute name="maxHeapSize" value="0x20000000" />
  <attribute name="initialHeapSize" value="0x400000" />
</initialized>

<cycle-start id="4" type="scavenge" contextid="0" timestamp="2015-12-04T20:17:10.677" intervalms="3457.977" />
<gc-start id="5" type="scavenge" contextid="4" timestamp="2015-12-04T20:17:10.677">
  <mem-info id="6" free="3037768" total="4194304" percent="72">
  </mem-info>
</gc-start>
<gc-end id="8" type="scavenge" contextid="4" durationms="0.807" usertimems="0.000" systemtimems="0.000" timestamp="2015-12-04T20:17:10.678" activeThreads="2">
  <mem-info id="9" free="3163968" total="4194304" percent="75">
  </mem-info>
</gc-end>
<cycle-end id="10" type="scavenge" contextid="4" timestamp="2015-12-04T20:17:10.678" />
<cycle-start id="16" type="scavenge" contextid="0" timestamp="2015-12-04T20:17:10.742" intervalms="64.838" />
<gc-start id="17" type="scavenge" contextid="16" timestamp="2015-12-04T20:17:10.742">
  <mem-info id="18" free="3037664" total="4194304" percent="72">
  </mem-info>
</gc-start>
 <gc-end id="20" type="scavenge" contextid="16" durationms="0.649" usertimems="0.000" systemtimems="0.000" timestamp="2015-12-04T20:17:10.743" activeThreads="2">
  <mem-info id="21" free="3110592" total="4194304" percent="74">
  </mem-info>
</gc-end>
<cycle-end id="22" type="scavenge" contextid="16" timestamp="2015-12-04T20:17:10.743" />
<allocation-satisfied id="23" threadId="0000000002E10500" bytesRequested="416" />


</verbosegc>

I want to mem-info::free in gc-start and gc-end, both of which are enclosed by cycle-start and cycle-end tags and have the same contexid. For example, the first two mem-info values are 3037768 and 3163968, the corresponding contextid is 4 which equals to the cycle-start id. With these data, I can draw the figure to show memory footprint.

The main problem for me is that I could not parse the XML sucessfully with the method in XML parse python. The getroot works but all other find/findall returns empty. Is there any other solutions for this? thanks

Here are my tries:

>>> tree = ET.parse('gc.trace')
>>> tree
<xml.etree.ElementTree.ElementTree object at 0x7fdfaddc19d0>
>>> root=tree.getroot()
>>> root
<Element '{http://www.ibm.com/j9/verbosegc}verbosegc' at 0x7fdfaddc1a90>
>>> cycle_start = root.findall('cycle-start')
>>> cycle_start
[]                ； Empty???
>>> cycle_start = root.findall('mem-info')
>>> print cycle_start
[]                 ;Empty???
>>> 
>>> cycle_start = root.find('mem-info')
>>> cycle_start
>>> print cycle_start
None

from lxml import etree
tree = etree.parse("gc.log")
root = tree.getroot()
>>root.findall('mem-info', root.nsmap)

>>> root.nsmap
{None: 'http://www.ibm.com/j9/verbosegc'}

1. your XML as posted has only one root, `verbosegc` 2. to get element(s) other than the root, try [using `find()` or `findall()`](https://docs.python.org/2/library/xml.etree.elementtree.html#finding-interesting-elements) 3. show what you have done so far and we may be able to help build solution on top of that — har07, Dec 06 '15 at 02:43
Thanks @har07, your are right; it is one root xml. See my update for commands I used. — shijie xu, Dec 06 '15 at 14:08
Possible duplicate of [Parsing XML with namespace in Python via 'ElementTree'](http://stackoverflow.com/questions/14853243/parsing-xml-with-namespace-in-python-via-elementtree) — har07, Dec 06 '15 at 14:23
thanks @har07, regarding to the namespace, how can I use lxml here so that I do not need to hard code the namespace? See my added code at the end of my post? Thanks — shijie xu, Dec 06 '15 at 14:54

har07 · Accepted Answer · 2015-12-07T04:34:35.703

That's because your XML has default namespace here :

xmlns="http://www.ibm.com/j9/verbosegc"

Notice that descendant element inherits ancestor's default namespace implicitly. You can use prefix-to-namespace mapping to get element in namespace, for example :

ns = {'d': 'http://www.ibm.com/j9/verbosegc'}
cycle_starts = root.findall('d:cycle-start', namespaces=ns)
print(cycle_starts)

mem_infos = root.findall('d:gc-start/d:mem-info', namespaces=ns)
print(mem_infos)

output :

[<Element '{http://www.ibm.com/j9/verbosegc}cycle-start' at 0x29ae6a0>, <Element '{http://www.ibm.com/j9/verbosegc}cycle-start' at 0x29ae8d0>]
[<Element '{http://www.ibm.com/j9/verbosegc}mem-info' at 0x29ae780>, <Element '{http://www.ibm.com/j9/verbosegc}mem-info' at 0x29ae9b0>]

update :

Responding to your comment, this is one possible way to avoid hard-coding the namespace :

#map default namespace uri to prefix d without hard-coding:
ns = {'d': root.nsmap[None]}
result = root.findall('.//d:mem-info', namespaces=ns)

as an aside, I'd suggest using xpath() method instead of findall() since the former provides better support for standard XPath 1.0 expression which will be useful in more complex situation.

Python parse XML has multiple root

1 Answers1