How to obtain the root of a tree without parsing the entire file?

Question

I'm making an xml parser to parse xml reports from different tools, and each tool generates different reports with different tags.

For example:

Arachni generates an xml report with <arachni_report></arachni_report> as tree root tag.

nmap generates an xml report with <nmaprun></nmaprun> as tree root tag.

I'm trying not to parse the entire file unless it's a valid report from any of the tools I want.

First thing I thought to use was ElementTree, parse the entire xml file (supposing it contains valid xml), and then check based on the tree root if the report belongs to Arachni or nmap.

I'm currently using cElementTree, and as far as I know getroot() is not an option here, but my goal is to make this parser to operate with recognized files only, without parsing unnecessary files.

By the way, I'm Still learning about xml parsing, thanks in advance.

@Matt. Well, if you don't want to parse an XML file, it seems to me that you must do a simple treatment of text with simpler tools than a XML parser: built-in string methods / regexes. I don't know very well XML, can you confirm me that the last line is always the end-tag of the root ? It could be sufficient to read this line to get the tree root tag. — eyquem, Mar 01 '11 at 22:33
@eyquem I do want to parse xml files if they are reports from nmap or Arachni. And yes, in this cases the last line is always the end-tag of the root, but in the future I may find reports with comments at the bottom, like: stuff — matta, Mar 01 '11 at 22:48

John Machin · Accepted Answer · 2011-03-02T22:33:53.237

"simple string methods" are the root [pun intended] of all evil -- see examples below.

Update 2 Code and output now show that proposed regexes also don't work very well.

Use ElementTree. The function that you are looking for is iterparse. Enable "start" events. Bale out on the first iteration.

Code:

# coding: ascii
import xml.etree.cElementTree as et
# import xml.etree.ElementTree as et
# import lxml.etree as et
from cStringIO import StringIO
import re

xml_text_1 = """\
<?xml version="1.0" ?> 
<!--  this is a comment --> 
<root
><foo>bar</foo></root
>
"""

xml_text_2 = """\
<?xml version="1.0" ?> 
<!--  this is a comment --> 
<root
><foo>bar</foo></root
>
<!--
That's all, folks! 
-->
"""

xml_text_3 = '''<?xml version="1.0" ?>
<!-- <mole1> -->
<root><foo /></root>
<!-- </mole2> -->'''

xml_text_4 = '''<?xml version="1.0" ?><!-- <mole1> --><root><foo /></root><!-- </mole2> -->'''

for xml_text in (xml_text_1, xml_text_2, xml_text_3, xml_text_4):
    print
    chrstr = xml_text.strip()
    x = max(chrstr.rfind('\r'),chrstr.rfind('\n'))
    lastline = chrstr[x:]
    print "*** eyquem 1:", repr(lastline.strip())

    chrstr = xml_text.strip()
    x = max(chrstr.rfind('\r'),chrstr.rfind('\n'))
    lastline = chrstr[x+1:]
    if lastline[0:5]=='<!-- ':
        chrstr = xml_text[0:x].rstrip()
        x = max(chrstr.rfind('\r'),chrstr.rfind('\n'))
        print "*** eyquem 2:", repr(chrstr[x+1:])
    else:
        print "*** eyquem 2:", repr(lastline)

    m = None
    for m in re.finditer('^</[^>]+>', xml_text, re.MULTILINE):
        pass
    if m: print "*** eyquem 3:", repr(m.group())
    else: print "*** eyquem 3:", "FAIL"

    m = None
    for m in re.finditer('</[^>]+>', xml_text):
        pass
    if m: print "*** eyquem 4:", repr(m.group())
    else: print "*** eyquem 4:", "FAIL"

    m = re.search('^<(?![?!])[^>]+>', xml_text, re.MULTILINE)
    if m: print "*** eyquem 5:", repr(m.group())
    else: print "*** eyquem 5:", "FAIL"

    m = re.search('<(?![?!])[^>]+>', xml_text)
    if m: print "*** eyquem 6:", repr(m.group())
    else: print "*** eyquem 6:", "FAIL"

    filelike_obj = StringIO(xml_text)
    tree = et.parse(filelike_obj)
    print "*** parse:", tree.getroot().tag

    filelike_obj = StringIO(xml_text)
    for event, elem in et.iterparse(filelike_obj, ('start', 'end')):
        print "*** iterparse:", elem.tag
        break

Above ElementTree-related code works with Python 2.5 to 2.7. Will work with Python 2.2 to 2.4; you just need to get ElementTree and cElementTree from effbot.org and do some conditional importing. Should work with any lxml version.

Output:

*** eyquem 1: '>'
*** eyquem 2: '>'
*** eyquem 3: FAIL
*** eyquem 4: '</root\n>'
*** eyquem 5: '<root\n>'
*** eyquem 6: '<root\n>'
*** parse: root
*** iterparse: root

*** eyquem 1: '-->'
*** eyquem 2: '-->'
*** eyquem 3: FAIL
*** eyquem 4: '</root\n>'
*** eyquem 5: '<root\n>'
*** eyquem 6: '<root\n>'
*** parse: root
*** iterparse: root

*** eyquem 1: '<!-- </mole2> -->'
*** eyquem 2: '<root><foo /></root>'
*** eyquem 3: FAIL
*** eyquem 4: '</mole2>'
*** eyquem 5: '<root>'
*** eyquem 6: '<mole1>'
*** parse: root
*** iterparse: root

*** eyquem 1: '>'
*** eyquem 2: '<?xml version="1.0" ?><!-- <mole1> --><root><foo /></root><!-- </mole2> -->'
*** eyquem 3: FAIL
*** eyquem 4: '</mole2>'
*** eyquem 5: FAIL
*** eyquem 6: '<mole1>'
*** parse: root
*** iterparse: root

Update 1 The above was demonstration code. Below is more like implementation code... just add exception handling. Tested with Python 2.7 and 2.2.

try:
    import xml.etree.cElementTree as ET
except ImportError:
    import cElementTree as ET

def get_root_tag_from_xml_file(xml_file_path):
    result = f = None
    try:
        f = open(xml_file_path, 'rb')
        for event, elem in ET.iterparse(f, ('start', )):
            result = elem.tag
            break
    finally:
        if f: f.close()
    return result

if __name__ == "__main__":
    import sys, glob
    for pattern in sys.argv[1:]:
        for filename in glob.glob(pattern):
            print filename, get_root_tag_from_xml_file(filename)

Thanks John!, I implemented your recommendation editing on my own answer, check it if you want! — matta, Mar 02 '11 at 05:14

eyquem · Answer 2 · 2011-03-02T00:36:25.200

0

Does this seem interesting to a connoisseur of XML ? :

ch = """\
<?xml version="1.0" encoding="ISO-8859-1" ?> 
<!--  Edited by XMLSpy® --> 
<CATALOG>
 <CD>
  <TITLE>Empire Burlesque</TITLE> 
  <ARTIST>Bob Dylan</ARTIST> 
  <COUNTRY>USA</COUNTRY> 
  <COMPANY>Columbia</COMPANY> 
  <PRICE>10.90</PRICE> 
  <YEAR>1985</YEAR> 
 </CD>
 <CD>
  <TITLE>Hide your heart</TITLE> 
  <ARTIST>Bonnie Tyler</ARTIST> 
  <COUNTRY>UK</COUNTRY> 
  <COMPANY>CBS Records</COMPANY> 
  <PRICE>9.90</PRICE> 
  <YEAR>1988</YEAR> 
 </CD>
</CATALOG>
<!-- This is the end of arachni report --> 

"""

chrstr = ch.strip()
x = max(chrstr.rfind('\r'),chrstr.rfind('\n'))
lastline = chrstr[x+1:]
if lastline[0:5]=='<!-- ':
    chrstr = ch[0:x].rstrip()
    x = max(chrstr.rfind('\r'),chrstr.rfind('\n'))
    print chrstr[x+1:]
else:
    print lastline

result, still

</CATALOG>

If necessary, one could add a verification that the start-tag of tree root is also around the beginning in the file

.

If the file is big, to speed up the treatment, we can move the pointeur of the file near the file's end (say 200 or 600 characters ante the end) to read and search in only a string of 200 or 600 characters long (the end-tag of the tree root of an XL doesn't have a greater length, does it ?)

from os.path import getsize

with open('I:\\uuu.txt') as f:

    L = getsize('I:\\uuu.txt')
    print 'L==',L

    f.seek( -min(600,L) , 2)
    ch = f.read()
    if '\r' not in ch and '\n' not in ch:
        f.seek(0,0)
        ch = f.read()        

    chrstr = ch.strip()
    x = max(chrstr.rfind('\r'),chrstr.rfind('\n'))
    lastline = chrstr[x+1:]
    if lastline[0:5]=='<!-- ':
        chrstr = ch[0:x].rstrip()
        x = max(chrstr.rfind('\r'),chrstr.rfind('\n'))
        print chrstr[x+1:]
    else:
        print lastline

edited Mar 02 '11 at 00:36

answered Mar 01 '11 at 22:46

eyquem

26,771
7
38
46

@eyquem I tunned up a little your code so I can load any file I want and it's working like a charm :). The duration of the parsing real 0m0.031s. user 0m0.020s sys 0m0.013s – matta Mar 01 '11 at 23:15
My answer doesn't deserve to be an accepted answer. I was incidentally tempted to delete it. Acceptance of this answer has been removed, I find this OK. But I don't think this answer deserves to be downvoted, it may give idea for a similar problem for which it could be more convenient. Well, my sleep won't be troubled by that, anyway – eyquem Mar 02 '11 at 00:39
@eyquem: Words almost fail me ... all I can say is "Whac-A-Mole" :-) – John Machin Mar 02 '11 at 07:07
@John Machin I read the wikipedia'article on the colloquial usage of "Whac-a-mole". I don't know if you employ it in the sense of _"repetitious and futile task"_ or _"phenomenon of fending off recurring spammers, vandals or miscreants"_ . And I don't understand for which reason you employ it : because of my comment(s) ? , because of the complication of the above code, because this code analyzes an XML file without a parser despite your yesterday's remark to another answer to the _Adding sibling element in ElementTree with Python_ question ? – eyquem Mar 02 '11 at 10:24
@eyquem: in the sense that you thought that you had the moles suppressed and then the OP found another problem (comment at the end) and you had to whack that and then I found a couple of more problems that needed whacking etc etc. You are NOT analysing an XML file -- don't kid yourself. You are kludging about with a mole hammer. – John Machin Mar 02 '11 at 12:37
@John Machin Note that english isn't my natural language, hence I don't understand too specialized word as 'kludge', 'mole'. But I understand that you reproach me with proposing a too much complicated code to answer to what seems a very simple need. I now think like this, too, but how to do it in some other way ? I mean: to catch the root's end-tag specifically ? I agree with you on the fact that it seems it isn't a really good idea to manage to catch this end-tag: too much complicated with no real advantage. – eyquem Mar 02 '11 at 15:27
@John Machin If I thought to this solution, it's because I felt a little reluctant about the idea to catch the root's start-tag in _«the noise before the root element»_ ( as termed by Ned Batchelder): I don't know well XML, and I am faltering about the possibility to catch a line around the beginning that will be the root's start-end with confidence. Then trying to catch the end-start seemed more sure to me. Well, you can then criticise this idea and this choosen option, and I agree with you on this point, finally. – eyquem Mar 02 '11 at 15:31
@John Machin But you might not say that, the option having been choosen, the answer given doesn't work. The choice to catch the end-tag isn't so good, but my code to do that isn't completely bad. In fact, my shortcoming is often to try to fulfill more requirements than asked for. – eyquem Mar 02 '11 at 15:32
@John Machin Presently, I tried 1) to catch end-tag, even not previously known ones 2) to do it without a parser (while the OP didn't really required that: he only wanted not to parse the text entirely), and even without a regex (would be less complicated with one regex). Then it's easy to say that my code is bad, but not true: it's a code that runs with gross shoes, that's all. In fact, in the present case, I join your opinion: he needs to use a parser, then that's it ! A next time, I will not try to fulfill unjustified demand, and I won't try to answer on subject I don't know well – eyquem Mar 02 '11 at 15:33
@eyquem, your code is fine if what I'm trying to parse is an estructured well formatted xml file – matta Mar 03 '11 at 04:23

score 0 · Answer 3 · answered Mar 01 '11 at 23:00

0

My understanding of your problem is this: You want to examine a file to determine if it is one of your recognized formats, and only parse it as XML if you know that it is one of the recognized formats. @eyquem is right: you should use simple string methods.

The simplest thing to do is to read some small amount from the beginning of the file, and see if it has a root element you recognize:

f = open(the_file)
head = f.read(200)
f.close()
if "<arachni_report" in head:
    #.. re-open and parse as arachni ..
elif "<nmaprun" in head:
    #.. re-open and parse as nmaprun ..

This method has the advantage that only a small amount of the file is read before determining whether it's an interesting file or not.

answered Mar 01 '11 at 23:00

Ned Batchelder

364,293
75
561
662

Good solution if the "names" of the tree root tag are known in advance. But it's a restriction. The start-tag of tree root can still be detected as the first tag, though ... – eyquem Mar 01 '11 at 23:24
This is much easier, both answers are great, but which one is the best in this case? I mean, as far as I tested them with 'time' command in my Linux they are quite the same on speed. – matta Mar 01 '11 at 23:33
@eyquem Yeah, the start-tag and the end-tag from the reports are both the root tag. – matta Mar 01 '11 at 23:35
I moderately like the principle to find the start-tag of tree root on the basis that it will be the first line that is not the declaration line, not a commentary line, not a processing instruction, not a CDTA instruction (if it can be in the beginning of an XCML file) or whatever else. But I wonder if all my messy business processing the end of the file is not in fact a drop hammer on a nut. And it is long. I don't find it very good, finally. – eyquem Mar 01 '11 at 23:55
You're wrong eyquem, because scans reports from nmap may sometimes have lots of irrelevant data to parse on the beginning of the file (irrelevant to me), for example: – matta Mar 02 '11 at 00:11

matta · Answer 4 · 2011-03-03T05:06:45.233

Final edit: Thanks to John Machin I'll be using this following code (this is a draft) based on his answer (which is the one I selected as correct).

I'd also like to thank eyquem for his responses and his persistence on defending his codes, I really learned a lot :)

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from cStringIO import StringIO

try:
    import xml.etree.cElementTree as ET
except ImportError:
    import xml.etree.ElementTree as ET

class reportRecognizer(object):

    def __init__(self, xml_report_path):
        self.report_type = ""

        root_tag = self.get_root_tag_from_xml_file(xml_report_path)

        if root_tag:
            self.report_type = self.rType(root_tag)


    def get_root_tag_from_xml_file(self, xml_file_path):
        result = f = None
        try:
            f = open(xml_file_path, 'rb')
            for event, elem in ET.iterparse(f, ('start', )):
                result = elem.tag
                break
        except IOError, err:
            print "Error while opening file.\n%s. %s" % (err, filepath)
        finally:
            if f: f.close()
        return result

    def rType(self, tag):
        if "arachni_report" == tag:
            return "Arachni XML Report"
        elif "nmaprun" == tag:
            return "Nmap XML Report"
        else:
            return "Unrecognized XML file, sorry."

report = reportRecognizer('report.xml')
print report.report_type

The simpler solution is to just read a larger chunk from the head. I'm sure there's a bound you can place on how far into the file the root open tag can be. Alternately, the noise before the root element is also indicative of what type of report you have in hand. But it sounds like you've found a solution, so it's win-win. — Ned Batchelder, Mar 02 '11 at 01:31
@Matt.: You don't want to do both `iterparse` and `parse`! You don't want to read the whole of your file and wrap its contents in a StringIO object! Using StringIO was just a device so that the xml examples could reside in my demonstration code. See my updated answer. — John Machin, Mar 02 '11 at 06:37

eyquem · Answer 5 · 2011-03-02T16:35:45.333

0

Are you serious, John Machin , when you show that my code wouldn't work correctly ?

Since I don't know well the XML format, I went there:

W3C's XML validator

Conclusion is that your text samples are not well-formed. Hence :

« The definition of an XML document excludes texts which contain violations of well-formedness rules; they are simply not XML. »

http://en.wikipedia.org/wiki/XML
The real evil is here.

Did you mean that I was supposed to have written a code able to detect tree root's tag in non-XML files ? I didn't know I had this over-requirement to fulfill.

.

Here's a code that frightens a little less than the one using only string methods. I didn't give it before because I would have received notifications that "..whisp...you MUST not employ regexes to analyse an XML text... whisp whisp"

import re
from os.path import getsize

with open('I:\\uuu.txt') as f:

    L = getsize('I:\\uuu.txt')
    print 'L==',L

    f.seek( -min(600,L) , 2)
    for mat in re.finditer('^</[^>]+>',f.read(),re.MULTILINE):
        pass
    print mat.group()

It coud be done the same in the more noisy beginning of the XML.

In fact, I prefer the solution given by John Machin, with the iterparse() function of ElementTree, and that's it !

.

EDIT

After all, I wonder why this wouldn't be enough....

import re

with open('I:\\uuu.txt') as f:
    print re.search('^<(?![?!])[^>]+>',f.read(),re.MULTILINE).group()

edited Mar 02 '11 at 16:35

answered Mar 02 '11 at 16:18

eyquem

26,771
7
38
46

@eyquem: I beg to differ with your conclusions. I have pasted each of my sample texts into that validator and its conclusion was **This document was successfully checked as well-formed XML!** together with 3 warnings (1 about no DOCTYPE, 1 about no encoding declared (but the default is UTF-8) and 1 about the fact that pasting ensures UTF-8). Your sample text gets the 1st and 3rd warnings. What did you paste, and what error messages did you get? – John Machin Mar 02 '11 at 19:10
@eyquem: If you are sure that my xml sample texts are not well formed, please raise bug reports on (1) Python (2) lxml -- as shown by my code (which performed a full parse, as well as an iterparse), those parsers emit no error messages when parsing my samples. Also write directly to the effbot telling him that ElementTree is aiding and abetting the propagation of evil. Don't forget to publish his response. – John Machin Mar 02 '11 at 19:57
@John Machin I took a long time to answer because I was occupied a lot trying to get in touch with Barack Obama too, in order to warn him even before Python, lxml team, effbot, and the Supreme Courts of all countries. – eyquem Mar 02 '11 at 22:24
@John Machin But it finally appeared that you and your texts are right. I have tried to reproduce what I believed that I saw but of course I couldn't since they are OK. I understand absolutely nothing about what has happened to drive me to this bewildering conclusion. – eyquem Mar 02 '11 at 22:25
@John Machin I have a little program that extract the codes present in a stackoverflow page, because when I copy-paste them by hand, the newlines are not catched and I obtain each code in one line. I tested again like that: I extracted the codes by means of my code, I copied-pasted each text in a .txt file, then I changed the extensions, and I tested them using validation by File Upload. They shined OK and me astounded. – eyquem Mar 02 '11 at 22:26
@John Machin The more probable explanation is that in fact I had tested other files among the diverse ones in the repertory. For I had already understood that the warnings are not sufficient to declare a text not well-formed and I didn't do misinterpretation, I guess. – eyquem Mar 02 '11 at 22:27
@John Machin Or it could be a team of scientific aliens who choosed my house to perform one of the tests of their Mental Devalidator Laser evaluation campaign they have planned to prepare their great future invasion by experiments of temporary brain softening and blackout. – eyquem Mar 02 '11 at 22:28
@John Machin Or more plausibly, it is the Almighty Who is behind this, because he wants me to stop to give so much time to the Goddess of Programming and He sees I am so much close to be converted by the seducing sermons of the XML Parser's priests. I had remarked several weird phenomenons since my rep counter passed beyond 666. Indeed this problem was my doom. – eyquem Mar 02 '11 at 22:28
@John Machin By the way, excuse me for having not thought that it was strange to believe that you had posted inexact affirmation. I didn't know that my brain could go out to make little refreshing trips and then come back. – eyquem Mar 02 '11 at 22:29
@eyquem: You should change your handle to `Verbositor` :) I lokk forward to your comments on the latest update to my answer. – John Machin Mar 02 '11 at 22:38
@Joh Machin And you could take `Picador` – eyquem Mar 02 '11 at 22:50
You guys made my day, my birthday was last week but I'll take your comments as my gift! Thanks to both, it was a really interesting debate, I learned a lot – matta Mar 03 '11 at 06:16

eyquem · Answer 6 · 2011-03-04T00:48:33.310

Here is what you were looking for, John Machin: the sequel of our serial . I verified that this time my brain was in its correct place, and I continued to think about the problem.

So you have extended the demonstration code. Now, with your several exemplifying texts, it is clear for me that the string methods are far to be sufficient, and I UNDERSTAND why. I am very interested to know the underneath of processes and to understand the concrete reasons of affirmations.

Then I studied more than I ever did the specifications of XML and performed tests with the W3c's validator to increase my understanding of details of the structure of a XML text. It's a rather severe occupation but interesting though. I saw that the format of an XML is a mix of very strict rules and of debonair liberties.

From the tricks you used in your exemples to tear my codes into pieces, I conclude that XML format doesn't require the text to be divided into lines. In fact, as the W3c's validator showed me, characters \n , \r and \t can be at many positions in a XML text, provided that they don't break a rule of structure.

For exemple they are authorized without any restriction between tags: as a consequence, an element may occupy several lines. Also, even tags can be splitted into several lines, or among several tabulations \t, provided that they occur after the name of one tag. There is nor requirement for the lines of a XML text to be indented as I always saw them: I understand now it's only a personal convenience choosen for ease of reading and logical comprehension.

Well, you know all that better than me, John Machin. Thanks to you, I am now alerted to the complexity of XML format and I better understand the reasons that make parsing unrealistic by other means than specialized parsers. I incidentally wonder if common coders are aware of this awkardness of XML format: the possibility of \n characters present here and there in an XML text.

.

Anyway, as I have been in this conceptual boiling pot for a while now, I continued to search for a solution for your whac_moles, John Machin, as an instructive play.

String methods being out of the game, I polished my regex.

I know, I know: you'll say me that analyzing an XML text can't be done even with a regex. Now that I know better why, I agree. But I don't pretend to parse an XML text: my regex won't extract any part of an XML tree, it will search only a little chunk of text. For the problem asked by OP, I consider the use of regex as non heretical.

.

From the beginning, I think that searching the end-tag of the root is more easy and natural, because an end-tag hasn't attributes and there is less "noise" around it than the start-tag of the root.

So my solution is now:

~~ open the XML file

~~ move the file's pointer to the position -200 from the end

~~ read the 200 last characters of the file

~~ here, two strategies:

either remove only the comments and then searching the tag with a regex taking the characters \n, \r, \t in account

or remove the comments and all the characters \n, \r, \t before searching the tag with a simpler regex

The bigger the file is, the speeder is this algorithm compared to the use of parse or iterparse. I wrote and examined all the results of the following codes. The first strategy is the faster one.

# coding: ascii
import xml.etree.cElementTree as et
# import xml.etree.ElementTree as et
# import lxml.etree as et
from cStringIO import StringIO
import re,urllib

xml5 = """\
<?xml version="1.0" ?> 
<!--  this is a comment --> 
<root\t
\r\t\r \r
><foo

>bar</foo\t \r></root
>
"""

xml6 = """\
<?xml version="1.0" ?> 
<!--  this is a comment --> 
<root
><foo
>bar</foo\n\t   \t></root \t
\r>
<!--  \r   \t
That's all, folks!

\t-->
"""

xml7 = '''<?xml version="1.0" ?>
<!-- <mole1> -->  
<root><foo

\t\t\r\r\t/></root  \t
>  
<!-- </mole2>\t \r
 \r-->
<!---->
'''

xml8 = '''<?xml version="1.0" ?><!-- \r<mole1> --><root>  \t\t<foo \t\r\r/></root>\t<!-- </mole2> -->'''


sock = urllib.urlopen('http://www.cafeconleche.org/books/bible/examples/18/18-4.xsl')
xml9 = sock.read()
sock.close()


def rp(x):
    return  '\\r' if x.group()=='\r' else '\\t'

for xml_text in (xml5, xml6, xml7, xml8, xml9):

    print '\\n\n'.join(re.sub('\r|\t',rp,xml_text).split('\n'))
    print '-----------------------------'

    xml_text_noc = re.sub('<!--.*?-->|[\n\r\t]','', xml_text,flags=re.DOTALL)
    RE11 = '(?<=</)[^ >]+(?= *>)(?!.*</[^>]+>)' # with assertions   # ^
    m  = re.search(RE11, xml_text_noc,re.DOTALL)
    print "***  eyquem 11: " + repr(m.group() if m else "FAIL")

    xml_text_noc = re.sub('<!--.*?-->|[\n\r\t]','', xml_text,flags=re.DOTALL)
    RE12 = '</([^ >]+) *>(?!.*</[^>]+>)'  # with group(1)   # ^
    m  = re.search(RE12, xml_text_noc,re.DOTALL)
    print "***  eyquem 12: " + repr(m.group(1) if m else "FAIL")

    xml_text_noc = re.sub('<!--.*?-->|[\n\r\t]','', xml_text,flags=re.DOTALL)
    RE13 = '</[^ >]+ *>(?!.*</[^>]+>)' # without group(1)   # ^
    m  = re.search(RE13, xml_text_noc,re.DOTALL)
    print "***  eyquem 13: " + repr(m.group()[2:-1].rstrip() if m else "FAIL")



    xml_text_noc = re.sub('<!--.*?-->','', xml_text,flags=re.DOTALL)
    RE14 = '(?<=</)[^ \n\r\t>]+(?=[ \n\r\t]*>)(?!.*</[^>]+>)' # with assertions  # ^
    m  = re.search(RE14, xml_text_noc,re.DOTALL)
    print "***  eyquem 14: " + repr(m.group() if m else "FAIL")

    xml_text_noc = re.sub('<!--.*?-->','', xml_text,flags=re.DOTALL)
    RE15 = '</([^ \n\r\t>]+)[ \n\r\t]*>(?!.*</[^>]+>)'  # with group(1)   # <
    m  = re.search(RE15, xml_text_noc,re.DOTALL)
    print "***  eyquem 15: " + repr(m.group(1).rstrip() if m else "FAIL")

    xml_text_noc = re.sub('<!--.*?-->','', xml_text,flags=re.DOTALL)
    RE16 = '</[^ \n\r\t>]+[ \n\r\t]*>(?!.*</[^>]+>)' # without group(1)   # <
    m  = re.search(RE16, xml_text_noc,re.DOTALL)
    print "***  eyquem 16: " + repr(m.group()[2:-1].rstrip() if m else "FAIL")



    print
    filelike_obj = StringIO(xml_text)
    tree = et.parse(filelike_obj)
    print "***      parse:  " + tree.getroot().tag

    filelike_obj = StringIO(xml_text)
    for event, elem in et.iterparse(filelike_obj, ('start', 'end')):
        print "***  iterparse:  " + elem.tag
        break


    print '\n============================================='

Result

<?xml version="1.0" ?> \n
<!--  this is a comment --> \n
<root\t\n
\r\t\r \r\n
><foo\n
\n
>bar</foo\t \r></root\n
>\n

-----------------------------
***  eyquem 11: 'root'
***  eyquem 12: 'root'
***  eyquem 13: 'root'
***  eyquem 14: 'root'
***  eyquem 15: 'root'
***  eyquem 16: 'root'

***      parse:  root
***  iterparse:  root

=============================================
<?xml version="1.0" ?> \n
<!--  this is a comment --> \n
<root\n
><foo\n
>bar</foo\n
\t   \t></root \t\n
\r>\n
<!--  \r   \t\n
That's all, folks!\n
\n
\t-->\n

-----------------------------
***  eyquem 11: 'root'
***  eyquem 12: 'root'
***  eyquem 13: 'root'
***  eyquem 14: 'root'
***  eyquem 15: 'root'
***  eyquem 16: 'root'

***      parse:  root
***  iterparse:  root

=============================================
<?xml version="1.0" ?>\n
<!-- <mole1> -->  \n
<root><foo\n
\n
\t\t\r\r\t/></root  \t\n
>  \n
<!-- </mole2>\t\n
-->\n
<!---->\n

-----------------------------
***  eyquem 11: 'root'
***  eyquem 12: 'root'
***  eyquem 13: 'root'
***  eyquem 14: 'root'
***  eyquem 15: 'root'
***  eyquem 16: 'root'

***      parse:  root
***  iterparse:  root

=============================================
<?xml version="1.0" ?><!-- \r<mole1> --><root>  \t\t<foo \t\r\r/></root>\t<!-- </mole2> -->
-----------------------------
***  eyquem 11: 'root'
***  eyquem 12: 'root'
***  eyquem 13: 'root'
***  eyquem 14: 'root'
***  eyquem 15: 'root'
***  eyquem 16: 'root'

***      parse:  root
***  iterparse:  root

=============================================
<?xml version="1.0"?>\r\n
<stylesheet\r\n
  xmlns="http://www.w3.org/XSL/Transform/1.0"\r\n
  xmlns:fo="http://www.w3.org/XSL/Format/1.0"\r\n
  result-ns="fo">\r\n
\r\n
  <template match="/">\r\n
    <fo:root xmlns:fo="http://www.w3.org/XSL/Format/1.0">\r\n
\r\n
      <fo:layout-master-set>\r\n
        <fo:simple-page-master page-master-name="only">\r\n
          <fo:region-body/>\r\n
        </fo:simple-page-master>\r\n
      </fo:layout-master-set>\r\n
\r\n
      <fo:page-sequence>\r\n
\r\n
       <fo:sequence-specification>\r\n
        <fo:sequence-specifier-single page-master-name="only"/>\r\n
       </fo:sequence-specification>\r\n
        \r\n
        <fo:flow>\r\n
          <apply-templates select="//ATOM"/>\r\n
        </fo:flow>\r\n
        \r\n
      </fo:page-sequence>\r\n
\r\n
    </fo:root>\r\n
  </template>\r\n
\r\n
  <template match="ATOM">\r\n
    <fo:block font-size="20pt" font-family="serif">\r\n
      <value-of select="NAME"/>\r\n
    </fo:block>\r\n
  </template>\r\n
\r\n
</stylesheet>\r\n

-----------------------------
***  eyquem 11: 'stylesheet'
***  eyquem 12: 'stylesheet'
***  eyquem 13: 'stylesheet'
***  eyquem 14: 'stylesheet'
***  eyquem 15: 'stylesheet'
***  eyquem 16: 'stylesheet'

***      parse:  {http://www.w3.org/XSL/Transform/1.0}stylesheet
***  iterparse:  {http://www.w3.org/XSL/Transform/1.0}stylesheet

=============================================

This code now measures the execution's times:

# coding: ascii
import xml.etree.cElementTree as et
# import xml.etree.ElementTree as et
# import lxml.etree as et
from cStringIO import StringIO
import re
import urllib
from time import clock

sock = urllib.urlopen('http://www.cafeconleche.org/books/bible/examples/18/18-4.xsl')
ch = sock.read()
sock.close()

# the following lines are intended to insert additional lines
# into the XML text before its recording in a file, in order to
# obtain a real file to use, containing an XML text 
# long enough to observe easily the timing's differences

li = ch.splitlines(True)[0:6] + 30*ch.splitlines(True)[6:-2] + ch.splitlines(True)[-2:]

with open('xml_example.xml','w') as f:
    f.write(''.join(li))

print 'length of XML text in a file : ',len(''.join(li)),'\n'



# timings

P,I,A,B,C,D,E,F = [],[],[],[],[],[],[],[],


n = 50

for cnt in xrange(50):

    te = clock()
    for i in xrange (n):
        with open('xml_example.xml') as filelike_obj:
            tree = et.parse(filelike_obj)
            res_parse = tree.getroot().tag
    P.append( clock()-te)

    te = clock()
    for i in xrange (n):
        with open('xml_example.xml') as filelike_obj:
            for event, elem in et.iterparse(filelike_obj, ('start', 'end')):
                res_iterparse = elem.tag
                break
    I.append(  clock()-te)


    RE11 = '(?<=</)[^ >]+(?= *>)(?!.*</[^>]+>)' # with assertions   # ^
    te = clock()
    for i in xrange (n):
        with open('xml_example.xml') as f:
            f.seek(-200,2)
            xml_text = f.read()
            xml_text_noc = re.sub('(<!--.*?-->|[\n\r\t])','', xml_text,flags=re.DOTALL)
            m  = re.search(RE11, xml_text_noc,re.DOTALL)
            res_eyq11 = m.group() if m else "FAIL"
    A.append(  clock()-te)


    RE12 = '</([^ >]+) *>(?!.*</[^>]+>)'  # with group(1)   # ^
    te = clock()
    for i in xrange (n):
        with open('xml_example.xml') as f:
            f.seek(-200,2)
            xml_text = f.read()
            xml_text_noc = re.sub('(<!--.*?-->|[\n\r\t])','', xml_text,flags=re.DOTALL)
            m  = re.search(RE12, xml_text_noc,re.DOTALL)
            res_eyq12 = m.group(1) if m else "FAIL"
    B.append(  clock()-te)


    RE13 = '</[^ >]+ *>(?!.*</[^>]+>)' # without group(1)   # ^
    te = clock()
    for i in xrange (n):
        with open('xml_example.xml') as f:
            f.seek(-200,2)
            xml_text = f.read()
            xml_text_noc = re.sub('(<!--.*?-->|[\n\r\t])','', xml_text,flags=re.DOTALL)
            m  = re.search(RE13, xml_text_noc,re.DOTALL)
            res_eyq13 = m.group()[2:-1] if m else "FAIL"
    C.append(  clock()-te)



    RE14 = '(?<=</)[^ \n\r\t>]+(?=[ \n\r\t]*>)(?!.*</[^>]+>)' # with assertions  # ^
    te = clock()
    for i in xrange (n):
        with open('xml_example.xml') as f:
            f.seek(-200,2)
            xml_text = f.read()
            xml_text_noc = re.sub('<!--.*?-->','', xml_text,flags=re.DOTALL)
            m  = re.search(RE14, xml_text_noc,re.DOTALL)
            res_eyq14 = m.group() if m else "FAIL"
    D.append(  clock()-te)


    RE15 = '</([^ \n\r\t>]+)[ \n\r\t]*>(?!.*</[^>]+>)'  # with group(1)   # <
    te = clock()
    for i in xrange (n):
        with open('xml_example.xml') as f:
            f.seek(-200,2)
            xml_text = f.read()
            xml_text_noc = re.sub('<!--.*?-->','', xml_text,flags=re.DOTALL)
            m  = re.search(RE15, xml_text_noc,re.DOTALL)
            res_eyq15 = m.group(1) if m else "FAIL"
    E.append(  clock()-te)


    RE16 = '</[^ \n\r\t>]+[ \n\r\t]*>(?!.*</[^>]+>)' # without group(1)   # <
    te = clock()
    for i in xrange (n):
        with open('xml_example.xml') as f:
            f.seek(-200,2)
            xml_text = f.read()
            xml_text_noc = re.sub('<!--.*?-->','', xml_text,flags=re.DOTALL)
            m  = re.search(RE16, xml_text_noc,re.DOTALL)
            res_eyq16 = m.group()[2:-1].rstrip() if m else "FAIL"
    F.append(  clock()-te)


print "***      parse:  " + res_parse, '  parse'
print "***  iterparse:  " + res_iterparse, '  iterparse'
print
print "***  eyquem 11:  " + repr(res_eyq11)
print "***  eyquem 12:  " + repr(res_eyq12)
print "***  eyquem 13:  " + repr(res_eyq13)
print "***  eyquem 14:  " + repr(res_eyq14)
print "***  eyquem 15:  " + repr(res_eyq15)
print "***  eyquem 16:  " + repr(res_eyq16)

print
print str(min(P))
print str(min(I))
print
print '\n'.join(str(u) for u in map(min,(A,B,C)))
print
print '\n'.join(str(u) for u in map(min,(D,E,F)))

Result:

length of XML text in a file :  22548 

***      parse:  {http://www.w3.org/XSL/Transform/1.0}stylesheet   parse
***  iterparse:  {http://www.w3.org/XSL/Transform/1.0}stylesheet   iterparse

***  eyquem 11:  'stylesheet'
***  eyquem 12:  'stylesheet'
***  eyquem 13:  'stylesheet'
***  eyquem 14:  'stylesheet'
***  eyquem 15:  'stylesheet'
***  eyquem 16:  'stylesheet'

0.220554691169
0.172240771802

0.0273236743636
0.0266525536625
0.0265308269626

0.0246300539733
0.0241203758299
0.0238024015203

.

Considering your unsophisticated need, Aereal, I think that you don't care to have an end-tag of the root with possible characters \r \n \t in it, instead of its name alone; So the best solution for you is, in my opinion:

def get_root_tag_from_xml_file(xml_file_path):
    with open(xml_file_path) as f:
        try:      f.seek(-200,2)
        except:   f.seek(0,0)
        finally:  xml_text_noc = re.sub('<!--.*?-->','', f.read(), flags= re.DOTALL)
        try:
            return re.search('</[^>]+>(?!.*</[^>]+>)' , xml_text_noc, re.DOTALL).group()
        except :
            return 'FAIL'

Thanks to the expertise of John Machin, this solution do a more reliable job than my previous one; and in addition it answers exactly to the demand, as it was expressed: no parsing, hence a faster method, as it was implicitly aimed at.

.

John Machin, will you find a new tricky feature of XML format that will invalidate this solution ?

"I incidentally wonder if common coders are aware of this awkardness of XML format: the possibility of \n characters present here and there in an XML text." I hope they are, because the very first thing that comes to my mind when thinking about parsing is the worst case for an xml file. — matta, Mar 05 '11 at 01:48

How to obtain the root of a tree without parsing the entire file?

6 Answers6