-1

I have next xml:

<Content>
<article title="I Compute, Therefore I am" id="a1">
        <authors>
            <author>Philbert von Cookie</author>
            <author>Alice Brockman</author>
            <author>Pedro Smith</author>
        </authors>
        <journal>
            <name>Journal of Computational Metaphysics</name>
            <volume>3</volume>
            <issue>7</issue>
            <published>04/11/2006</published>
            <pages start="42" end="49"/>
        </journal>
</article>
...
</Content>

There are a lot of similar article nodes inside the root element -> content

i have parsed my xml into python code and want to get maximum date value. Here is my python code:

try:
    import xml.etree.cElementTree as ET
except ImportError:
    import xml.etree.ElementTree as ET

tree = ET.ElementTree(file='data.xml')
root = tree.getroot()
root.tag, root.attrib

I am trying to get it using iterfind(), but this not works so far.

for elem in tree.iterfind('(/*/*/journal/published/value[not(text() < preceding-sibling::value/text()) and not(text() < following-sibling::value/text())])[1]'):
 print (elem.text)

Can you please help me with answer how do i set my XPATH for iterfind() or may be there are any other ways to do that? Thank You.

Don Korleone
  • 321
  • 2
  • 3
  • 9

1 Answers1

1

xml.etree.ElementTree provides only limited xpath support.

One alternative option would be to parse all dates into a list and get the maximum value:

from datetime import datetime

dates = [published.text for published in root.iterfind('.//article/journal/published')]
print max(dates, key=lambda x: datetime.strptime(x, '%d/%m/%Y'))

Note that in order to find max value in this case, you should compare datetime values, not strings (this is where key function helps).


Also, if you want to get the corresponding to the max date journal record, you can construct a dictionary mapping "date -> journal" and then get the appropriate journal record:

from datetime import datetime
import operator

try:
    import xml.etree.cElementTree as ET
except ImportError:
    import xml.etree.ElementTree as ET

tree = ET.ElementTree(file='data.xml')
root = tree.getroot()

mapping = {datetime.strptime(journal.findtext('published'), '%d/%m/%Y'): journal 
           for journal in root.iterfind('.//article/journal')}

journal_latest = max(mapping.iteritems(), key=operator.itemgetter(0))[1]
print journal_latest.findtext('name')
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • I can't run the last part of the code. Python says Syntax error: Traceback (most recent call last): File "find.py", line 13, in for journal in root.iterfind('.//article/journal')} File "find.py", line 13, in for journal in root.iterfind('.//article/journal')} File "C:\Python34\lib\_strptime.py", line 500, in _strptime_datetime tt, fraction = _strptime(data_string, format) File "C:\Python34\lib\_strptime.py", line 337, in _strptime (data_string, format)) ValueError: time data '05/25/2002' does not match format '%d/%m/%Y' – Don Korleone Oct 18 '14 at 02:15
  • @DonKorleone the format is not what I was expecting, swap month and day: `%m/%d/%Y` instead of `%d/%m/%Y`. – alecxe Oct 18 '14 at 02:17
  • And how you use journal.findtext('published')? what is journal? If this is node's name, where it was defined before? Sorry may be this is simple question, but i am new in python. So i would appreciate if you could explain step by step – Don Korleone Oct 18 '14 at 02:25
  • If i change date it not matches. So initially it was right – Don Korleone Oct 18 '14 at 02:28
  • @DonKorleone well, `05/25/2002` is clearly `%m/%d/%Y`. `journal` is every resulting node found by `.//article/journal` xpath expression. Not sure how to help more - it works for me on the input you've provided. – alecxe Oct 18 '14 at 03:20
  • Finally it worked. Thank you so much. One last question. What should i do in order to get article's title attribute value? – Don Korleone Oct 19 '14 at 04:01
  • @DonKorleone good, glad it helped. Well, in the current state of the code, we don't have a corresponding `article` for each `journal`, and since ElementTree doesn't support getting a parent of a node (http://stackoverflow.com/questions/2170610/access-elementtree-node-parent-node) - we should change the way we iterate over the nodes. Could you show what is the current code you said is working? Thanks. – alecxe Oct 19 '14 at 05:01
  • Hi. This code is working now: from datetime import datetime import operator try: import xml.etree.cElementTree as ET except ImportError: import xml.etree.ElementTree as ET tree = ET.ElementTree(file='data.xml') root = tree.getroot() mapping = {datetime.strptime(journal.findtext('published'), '%m/%d/%Y'): journal for journal in root.iterfind('.//article/journal')} journal_latest = max(mapping.items(), key=operator.itemgetter(0))[1] print (journal_latest.findtext('name')) – Don Korleone Oct 19 '14 at 05:28
  • Ok, i was able to get parent node's attribute to print it. Thank you alecxe so much for your help. Appreciate that – Don Korleone Oct 20 '14 at 03:26