1

I failed to parse an XML file(it is GC history). Sample of the XML is shown below.

<?xml version="1.0" ?>

<verbosegc xmlns="http://www.ibm.com/j9/verbosegc" version="R28_jvm.28_20150612_0201_B252774_CMPRSS">

<initialized id="1" timestamp="2015-12-04T20:17:07.219">
  <attribute name="gcPolicy" value="-Xgcpolicy:gencon" />
  <attribute name="maxHeapSize" value="0x20000000" />
  <attribute name="initialHeapSize" value="0x400000" />
</initialized>

<cycle-start id="4" type="scavenge" contextid="0" timestamp="2015-12-04T20:17:10.677" intervalms="3457.977" />
<gc-start id="5" type="scavenge" contextid="4" timestamp="2015-12-04T20:17:10.677">
  <mem-info id="6" free="3037768" total="4194304" percent="72">
  </mem-info>
</gc-start>
<gc-end id="8" type="scavenge" contextid="4" durationms="0.807" usertimems="0.000" systemtimems="0.000" timestamp="2015-12-04T20:17:10.678" activeThreads="2">
  <mem-info id="9" free="3163968" total="4194304" percent="75">
  </mem-info>
</gc-end>
<cycle-end id="10" type="scavenge" contextid="4" timestamp="2015-12-04T20:17:10.678" />
<cycle-start id="16" type="scavenge" contextid="0" timestamp="2015-12-04T20:17:10.742" intervalms="64.838" />
<gc-start id="17" type="scavenge" contextid="16" timestamp="2015-12-04T20:17:10.742">
  <mem-info id="18" free="3037664" total="4194304" percent="72">
  </mem-info>
</gc-start>
 <gc-end id="20" type="scavenge" contextid="16" durationms="0.649" usertimems="0.000" systemtimems="0.000" timestamp="2015-12-04T20:17:10.743" activeThreads="2">
  <mem-info id="21" free="3110592" total="4194304" percent="74">
  </mem-info>
</gc-end>
<cycle-end id="22" type="scavenge" contextid="16" timestamp="2015-12-04T20:17:10.743" />
<allocation-satisfied id="23" threadId="0000000002E10500" bytesRequested="416" />


</verbosegc>

I want to mem-info::free in gc-start and gc-end, both of which are enclosed by cycle-start and cycle-end tags and have the same contexid. For example, the first two mem-info values are 3037768 and 3163968, the corresponding contextid is 4 which equals to the cycle-start id. With these data, I can draw the figure to show memory footprint.

The main problem for me is that I could not parse the XML sucessfully with the method in XML parse python. The getroot works but all other find/findall returns empty. Is there any other solutions for this? thanks

Here are my tries:

>>> tree = ET.parse('gc.trace')
>>> tree
<xml.etree.ElementTree.ElementTree object at 0x7fdfaddc19d0>
>>> root=tree.getroot()
>>> root
<Element '{http://www.ibm.com/j9/verbosegc}verbosegc' at 0x7fdfaddc1a90>
>>> cycle_start = root.findall('cycle-start')
>>> cycle_start
[]                ; Empty???
>>> cycle_start = root.findall('mem-info')
>>> print cycle_start
[]                 ;Empty???
>>> 
>>> cycle_start = root.find('mem-info')
>>> cycle_start
>>> print cycle_start
None

from lxml import etree
tree = etree.parse("gc.log")
root = tree.getroot()
>>root.findall('mem-info', root.nsmap)

>>> root.nsmap
{None: 'http://www.ibm.com/j9/verbosegc'} 
shijie xu
  • 1,975
  • 21
  • 52
  • 2
    1. your XML as posted has only one root, `verbosegc` 2. to get element(s) other than the root, try [using `find()` or `findall()`](https://docs.python.org/2/library/xml.etree.elementtree.html#finding-interesting-elements) 3. show what you have done so far and we may be able to help build solution on top of that – har07 Dec 06 '15 at 02:43
  • Thanks @har07, your are right; it is one root xml. See my update for commands I used. – shijie xu Dec 06 '15 at 14:08
  • Now I can see what the actual problem is.. – har07 Dec 06 '15 at 14:22
  • 1
    Possible duplicate of [Parsing XML with namespace in Python via 'ElementTree'](http://stackoverflow.com/questions/14853243/parsing-xml-with-namespace-in-python-via-elementtree) – har07 Dec 06 '15 at 14:23
  • thanks @har07, regarding to the namespace, how can I use lxml here so that I do not need to hard code the namespace? See my added code at the end of my post? Thanks – shijie xu Dec 06 '15 at 14:54
  • see **update** section of my answer.. – har07 Dec 07 '15 at 04:35

1 Answers1

0

That's because your XML has default namespace here :

xmlns="http://www.ibm.com/j9/verbosegc"

Notice that descendant element inherits ancestor's default namespace implicitly. You can use prefix-to-namespace mapping to get element in namespace, for example :

ns = {'d': 'http://www.ibm.com/j9/verbosegc'}
cycle_starts = root.findall('d:cycle-start', namespaces=ns)
print(cycle_starts)

mem_infos = root.findall('d:gc-start/d:mem-info', namespaces=ns)
print(mem_infos)

output :

[<Element '{http://www.ibm.com/j9/verbosegc}cycle-start' at 0x29ae6a0>, <Element '{http://www.ibm.com/j9/verbosegc}cycle-start' at 0x29ae8d0>]
[<Element '{http://www.ibm.com/j9/verbosegc}mem-info' at 0x29ae780>, <Element '{http://www.ibm.com/j9/verbosegc}mem-info' at 0x29ae9b0>]

update :

Responding to your comment, this is one possible way to avoid hard-coding the namespace :

#map default namespace uri to prefix d without hard-coding:
ns = {'d': root.nsmap[None]}
result = root.findall('.//d:mem-info', namespaces=ns)

as an aside, I'd suggest using xpath() method instead of findall() since the former provides better support for standard XPath 1.0 expression which will be useful in more complex situation.

har07
  • 88,338
  • 12
  • 84
  • 137