Here is what you were looking for, John Machin: the sequel of our serial . I verified that this time my brain was in its correct place, and I continued to think about the problem.
So you have extended the demonstration code.
Now, with your several exemplifying texts, it is clear for me that the string methods are far to be sufficient, and I UNDERSTAND why. I am very interested to know the underneath of processes and to understand the concrete reasons of affirmations.
Then I studied more than I ever did the specifications of XML and performed tests with the W3c's validator to increase my understanding of details of the structure of a XML text. It's a rather severe occupation but interesting though. I saw that the format of an XML is a mix of very strict rules and of debonair liberties.
From the tricks you used in your exemples to tear my codes into pieces, I conclude that XML format doesn't require the text to be divided into lines. In fact, as the W3c's validator showed me, characters \n
, \r
and \t
can be at many positions in a XML text, provided that they don't break a rule of structure.
For exemple they are authorized without any restriction between tags: as a consequence, an element may occupy several lines. Also, even tags can be splitted into several lines, or among several tabulations \t
, provided that they occur after the name of one tag. There is nor requirement for the lines of a XML text to be indented as I always saw them: I understand now it's only a personal convenience choosen for ease of reading and logical comprehension.
Well, you know all that better than me, John Machin. Thanks to you, I am now alerted to the complexity of XML format and I better understand the reasons that make parsing unrealistic by other means than specialized parsers. I incidentally wonder if common coders are aware of this awkardness of XML format: the possibility of \n
characters present here and there in an XML text.
.
Anyway, as I have been in this conceptual boiling pot for a while now, I continued to search for a solution for your whac_moles, John Machin, as an instructive play.
String methods being out of the game, I polished my regex.
I know, I know: you'll say me that analyzing an XML text can't be done even with a regex. Now that I know better why, I agree. But I don't pretend to parse an XML text: my regex won't extract any part of an XML tree, it will search only a little chunk of text. For the problem asked by OP, I consider the use of regex as non heretical.
.
From the beginning, I think that searching the end-tag of the root is more easy and natural, because an end-tag hasn't attributes and there is less "noise" around it than the start-tag of the root.
So my solution is now:
~~ open the XML file
~~ move the file's pointer to the position -200 from the end
~~ read the 200 last characters of the file
~~ here, two strategies:
- either remove only the comments and then searching the tag with a regex taking the characters \n, \r, \t in account
- or remove the comments and all the characters \n, \r, \t before searching
the tag with a simpler regex
The bigger the file is, the speeder is this algorithm compared to the use of parse or iterparse. I wrote and examined all the results of the following codes. The first strategy is the faster one.
# coding: ascii
import xml.etree.cElementTree as et
# import xml.etree.ElementTree as et
# import lxml.etree as et
from cStringIO import StringIO
import re,urllib
xml5 = """\
<?xml version="1.0" ?>
<!-- this is a comment -->
<root\t
\r\t\r \r
><foo
>bar</foo\t \r></root
>
"""
xml6 = """\
<?xml version="1.0" ?>
<!-- this is a comment -->
<root
><foo
>bar</foo\n\t \t></root \t
\r>
<!-- \r \t
That's all, folks!
\t-->
"""
xml7 = '''<?xml version="1.0" ?>
<!-- <mole1> -->
<root><foo
\t\t\r\r\t/></root \t
>
<!-- </mole2>\t \r
\r-->
<!---->
'''
xml8 = '''<?xml version="1.0" ?><!-- \r<mole1> --><root> \t\t<foo \t\r\r/></root>\t<!-- </mole2> -->'''
sock = urllib.urlopen('http://www.cafeconleche.org/books/bible/examples/18/18-4.xsl')
xml9 = sock.read()
sock.close()
def rp(x):
return '\\r' if x.group()=='\r' else '\\t'
for xml_text in (xml5, xml6, xml7, xml8, xml9):
print '\\n\n'.join(re.sub('\r|\t',rp,xml_text).split('\n'))
print '-----------------------------'
xml_text_noc = re.sub('<!--.*?-->|[\n\r\t]','', xml_text,flags=re.DOTALL)
RE11 = '(?<=</)[^ >]+(?= *>)(?!.*</[^>]+>)' # with assertions # ^
m = re.search(RE11, xml_text_noc,re.DOTALL)
print "*** eyquem 11: " + repr(m.group() if m else "FAIL")
xml_text_noc = re.sub('<!--.*?-->|[\n\r\t]','', xml_text,flags=re.DOTALL)
RE12 = '</([^ >]+) *>(?!.*</[^>]+>)' # with group(1) # ^
m = re.search(RE12, xml_text_noc,re.DOTALL)
print "*** eyquem 12: " + repr(m.group(1) if m else "FAIL")
xml_text_noc = re.sub('<!--.*?-->|[\n\r\t]','', xml_text,flags=re.DOTALL)
RE13 = '</[^ >]+ *>(?!.*</[^>]+>)' # without group(1) # ^
m = re.search(RE13, xml_text_noc,re.DOTALL)
print "*** eyquem 13: " + repr(m.group()[2:-1].rstrip() if m else "FAIL")
xml_text_noc = re.sub('<!--.*?-->','', xml_text,flags=re.DOTALL)
RE14 = '(?<=</)[^ \n\r\t>]+(?=[ \n\r\t]*>)(?!.*</[^>]+>)' # with assertions # ^
m = re.search(RE14, xml_text_noc,re.DOTALL)
print "*** eyquem 14: " + repr(m.group() if m else "FAIL")
xml_text_noc = re.sub('<!--.*?-->','', xml_text,flags=re.DOTALL)
RE15 = '</([^ \n\r\t>]+)[ \n\r\t]*>(?!.*</[^>]+>)' # with group(1) # <
m = re.search(RE15, xml_text_noc,re.DOTALL)
print "*** eyquem 15: " + repr(m.group(1).rstrip() if m else "FAIL")
xml_text_noc = re.sub('<!--.*?-->','', xml_text,flags=re.DOTALL)
RE16 = '</[^ \n\r\t>]+[ \n\r\t]*>(?!.*</[^>]+>)' # without group(1) # <
m = re.search(RE16, xml_text_noc,re.DOTALL)
print "*** eyquem 16: " + repr(m.group()[2:-1].rstrip() if m else "FAIL")
print
filelike_obj = StringIO(xml_text)
tree = et.parse(filelike_obj)
print "*** parse: " + tree.getroot().tag
filelike_obj = StringIO(xml_text)
for event, elem in et.iterparse(filelike_obj, ('start', 'end')):
print "*** iterparse: " + elem.tag
break
print '\n============================================='
Result
<?xml version="1.0" ?> \n
<!-- this is a comment --> \n
<root\t\n
\r\t\r \r\n
><foo\n
\n
>bar</foo\t \r></root\n
>\n
-----------------------------
*** eyquem 11: 'root'
*** eyquem 12: 'root'
*** eyquem 13: 'root'
*** eyquem 14: 'root'
*** eyquem 15: 'root'
*** eyquem 16: 'root'
*** parse: root
*** iterparse: root
=============================================
<?xml version="1.0" ?> \n
<!-- this is a comment --> \n
<root\n
><foo\n
>bar</foo\n
\t \t></root \t\n
\r>\n
<!-- \r \t\n
That's all, folks!\n
\n
\t-->\n
-----------------------------
*** eyquem 11: 'root'
*** eyquem 12: 'root'
*** eyquem 13: 'root'
*** eyquem 14: 'root'
*** eyquem 15: 'root'
*** eyquem 16: 'root'
*** parse: root
*** iterparse: root
=============================================
<?xml version="1.0" ?>\n
<!-- <mole1> --> \n
<root><foo\n
\n
\t\t\r\r\t/></root \t\n
> \n
<!-- </mole2>\t\n
-->\n
<!---->\n
-----------------------------
*** eyquem 11: 'root'
*** eyquem 12: 'root'
*** eyquem 13: 'root'
*** eyquem 14: 'root'
*** eyquem 15: 'root'
*** eyquem 16: 'root'
*** parse: root
*** iterparse: root
=============================================
<?xml version="1.0" ?><!-- \r<mole1> --><root> \t\t<foo \t\r\r/></root>\t<!-- </mole2> -->
-----------------------------
*** eyquem 11: 'root'
*** eyquem 12: 'root'
*** eyquem 13: 'root'
*** eyquem 14: 'root'
*** eyquem 15: 'root'
*** eyquem 16: 'root'
*** parse: root
*** iterparse: root
=============================================
<?xml version="1.0"?>\r\n
<stylesheet\r\n
xmlns="http://www.w3.org/XSL/Transform/1.0"\r\n
xmlns:fo="http://www.w3.org/XSL/Format/1.0"\r\n
result-ns="fo">\r\n
\r\n
<template match="/">\r\n
<fo:root xmlns:fo="http://www.w3.org/XSL/Format/1.0">\r\n
\r\n
<fo:layout-master-set>\r\n
<fo:simple-page-master page-master-name="only">\r\n
<fo:region-body/>\r\n
</fo:simple-page-master>\r\n
</fo:layout-master-set>\r\n
\r\n
<fo:page-sequence>\r\n
\r\n
<fo:sequence-specification>\r\n
<fo:sequence-specifier-single page-master-name="only"/>\r\n
</fo:sequence-specification>\r\n
\r\n
<fo:flow>\r\n
<apply-templates select="//ATOM"/>\r\n
</fo:flow>\r\n
\r\n
</fo:page-sequence>\r\n
\r\n
</fo:root>\r\n
</template>\r\n
\r\n
<template match="ATOM">\r\n
<fo:block font-size="20pt" font-family="serif">\r\n
<value-of select="NAME"/>\r\n
</fo:block>\r\n
</template>\r\n
\r\n
</stylesheet>\r\n
-----------------------------
*** eyquem 11: 'stylesheet'
*** eyquem 12: 'stylesheet'
*** eyquem 13: 'stylesheet'
*** eyquem 14: 'stylesheet'
*** eyquem 15: 'stylesheet'
*** eyquem 16: 'stylesheet'
*** parse: {http://www.w3.org/XSL/Transform/1.0}stylesheet
*** iterparse: {http://www.w3.org/XSL/Transform/1.0}stylesheet
=============================================
This code now measures the execution's times:
# coding: ascii
import xml.etree.cElementTree as et
# import xml.etree.ElementTree as et
# import lxml.etree as et
from cStringIO import StringIO
import re
import urllib
from time import clock
sock = urllib.urlopen('http://www.cafeconleche.org/books/bible/examples/18/18-4.xsl')
ch = sock.read()
sock.close()
# the following lines are intended to insert additional lines
# into the XML text before its recording in a file, in order to
# obtain a real file to use, containing an XML text
# long enough to observe easily the timing's differences
li = ch.splitlines(True)[0:6] + 30*ch.splitlines(True)[6:-2] + ch.splitlines(True)[-2:]
with open('xml_example.xml','w') as f:
f.write(''.join(li))
print 'length of XML text in a file : ',len(''.join(li)),'\n'
# timings
P,I,A,B,C,D,E,F = [],[],[],[],[],[],[],[],
n = 50
for cnt in xrange(50):
te = clock()
for i in xrange (n):
with open('xml_example.xml') as filelike_obj:
tree = et.parse(filelike_obj)
res_parse = tree.getroot().tag
P.append( clock()-te)
te = clock()
for i in xrange (n):
with open('xml_example.xml') as filelike_obj:
for event, elem in et.iterparse(filelike_obj, ('start', 'end')):
res_iterparse = elem.tag
break
I.append( clock()-te)
RE11 = '(?<=</)[^ >]+(?= *>)(?!.*</[^>]+>)' # with assertions # ^
te = clock()
for i in xrange (n):
with open('xml_example.xml') as f:
f.seek(-200,2)
xml_text = f.read()
xml_text_noc = re.sub('(<!--.*?-->|[\n\r\t])','', xml_text,flags=re.DOTALL)
m = re.search(RE11, xml_text_noc,re.DOTALL)
res_eyq11 = m.group() if m else "FAIL"
A.append( clock()-te)
RE12 = '</([^ >]+) *>(?!.*</[^>]+>)' # with group(1) # ^
te = clock()
for i in xrange (n):
with open('xml_example.xml') as f:
f.seek(-200,2)
xml_text = f.read()
xml_text_noc = re.sub('(<!--.*?-->|[\n\r\t])','', xml_text,flags=re.DOTALL)
m = re.search(RE12, xml_text_noc,re.DOTALL)
res_eyq12 = m.group(1) if m else "FAIL"
B.append( clock()-te)
RE13 = '</[^ >]+ *>(?!.*</[^>]+>)' # without group(1) # ^
te = clock()
for i in xrange (n):
with open('xml_example.xml') as f:
f.seek(-200,2)
xml_text = f.read()
xml_text_noc = re.sub('(<!--.*?-->|[\n\r\t])','', xml_text,flags=re.DOTALL)
m = re.search(RE13, xml_text_noc,re.DOTALL)
res_eyq13 = m.group()[2:-1] if m else "FAIL"
C.append( clock()-te)
RE14 = '(?<=</)[^ \n\r\t>]+(?=[ \n\r\t]*>)(?!.*</[^>]+>)' # with assertions # ^
te = clock()
for i in xrange (n):
with open('xml_example.xml') as f:
f.seek(-200,2)
xml_text = f.read()
xml_text_noc = re.sub('<!--.*?-->','', xml_text,flags=re.DOTALL)
m = re.search(RE14, xml_text_noc,re.DOTALL)
res_eyq14 = m.group() if m else "FAIL"
D.append( clock()-te)
RE15 = '</([^ \n\r\t>]+)[ \n\r\t]*>(?!.*</[^>]+>)' # with group(1) # <
te = clock()
for i in xrange (n):
with open('xml_example.xml') as f:
f.seek(-200,2)
xml_text = f.read()
xml_text_noc = re.sub('<!--.*?-->','', xml_text,flags=re.DOTALL)
m = re.search(RE15, xml_text_noc,re.DOTALL)
res_eyq15 = m.group(1) if m else "FAIL"
E.append( clock()-te)
RE16 = '</[^ \n\r\t>]+[ \n\r\t]*>(?!.*</[^>]+>)' # without group(1) # <
te = clock()
for i in xrange (n):
with open('xml_example.xml') as f:
f.seek(-200,2)
xml_text = f.read()
xml_text_noc = re.sub('<!--.*?-->','', xml_text,flags=re.DOTALL)
m = re.search(RE16, xml_text_noc,re.DOTALL)
res_eyq16 = m.group()[2:-1].rstrip() if m else "FAIL"
F.append( clock()-te)
print "*** parse: " + res_parse, ' parse'
print "*** iterparse: " + res_iterparse, ' iterparse'
print
print "*** eyquem 11: " + repr(res_eyq11)
print "*** eyquem 12: " + repr(res_eyq12)
print "*** eyquem 13: " + repr(res_eyq13)
print "*** eyquem 14: " + repr(res_eyq14)
print "*** eyquem 15: " + repr(res_eyq15)
print "*** eyquem 16: " + repr(res_eyq16)
print
print str(min(P))
print str(min(I))
print
print '\n'.join(str(u) for u in map(min,(A,B,C)))
print
print '\n'.join(str(u) for u in map(min,(D,E,F)))
Result:
length of XML text in a file : 22548
*** parse: {http://www.w3.org/XSL/Transform/1.0}stylesheet parse
*** iterparse: {http://www.w3.org/XSL/Transform/1.0}stylesheet iterparse
*** eyquem 11: 'stylesheet'
*** eyquem 12: 'stylesheet'
*** eyquem 13: 'stylesheet'
*** eyquem 14: 'stylesheet'
*** eyquem 15: 'stylesheet'
*** eyquem 16: 'stylesheet'
0.220554691169
0.172240771802
0.0273236743636
0.0266525536625
0.0265308269626
0.0246300539733
0.0241203758299
0.0238024015203
.
.
Considering your unsophisticated need, Aereal, I think that you don't care to have an end-tag of the root with possible characters \r
\n
\t
in it, instead of its name alone; So the best solution for you is, in my opinion:
def get_root_tag_from_xml_file(xml_file_path):
with open(xml_file_path) as f:
try: f.seek(-200,2)
except: f.seek(0,0)
finally: xml_text_noc = re.sub('<!--.*?-->','', f.read(), flags= re.DOTALL)
try:
return re.search('</[^>]+>(?!.*</[^>]+>)' , xml_text_noc, re.DOTALL).group()
except :
return 'FAIL'
Thanks to the expertise of John Machin, this solution do a more reliable job than my previous one; and in addition it answers exactly to the demand, as it was expressed: no parsing, hence a faster method, as it was implicitly aimed at.
.
John Machin, will you find a new tricky feature of XML format that will invalidate this solution ?