With file of name 'foo.html' containing
<body>
<tag>Blah blah blah</tag>
<tag>**Catalina 320**</tag>
<tag>Blah<tag>
<td>**Catalina 320**</td>
</tag>Blah Blah </tag>
<tag>**These boats** are fully booked for the day</tag>
<tag>Blah blah blah</tag>
<tag>Catalina 320</tag>
<tag>Catalina 320</tag>
</body>
code:
from time import clock
n = 1000
########################################################################
import lxml.etree as ET
from lxml.etree import XMLParser
parser = XMLParser(ns_clean=True, recover=True)
etree = ET.parse('foo.html', parser)
te = clock()
for i in xrange(n):
resultsArray = []
for thing in etree.findall("//"):
if "These boats" in thing.text:
break
elif "Catalina 320"in thing.text:
resultsArray.append(ET.tostring(thing).strip())
tf = clock()
print 'Solution with lxml'
print tf-te,'\n',resultsArray
########################################################################
with open('foo.html') as f:
text = f.read()
import re
print '\n\n----------------------------------'
rigx = re.compile('(Catalina 320)(?:(?:.(?!Catalina 320))*These boats.*\Z)?',re.DOTALL)
te = clock()
for i in xrange(n):
yi = rigx.findall(text)
tf = clock()
print 'Solution 1 with a regex'
print tf-te,'\n',yi
print '\n----------------------------------'
ragx = re.compile('(Catalina 320)|(These boats)')
te = clock()
for i in xrange(n):
li = []
for mat in ragx.finditer(text):
if mat.group(2):
break
else:
li.append(mat.group(1))
tf = clock()
print 'Solution 2 with a regex, similar to solution with lxml'
print tf-te,'\n',li
print '\n----------------------------------'
regx = re.compile('(Catalina 320)')
te = clock()
for i in xrange(n):
ye = regx.findall(text, 0, text.find('These boats') if 'These boats' in text else len(text))
tf = clock()
print 'Solution 3 with a regex'
print tf-te,'\n',ye
result
Solution with lxml
0.30324105438
['<tag>**Catalina 320**</tag>', '<td>**Catalina 320**</td>']
----------------------------------
Solution 1 with regex
0.0245033935877
['Catalina 320', 'Catalina 320']
----------------------------------
Solution 2 with a regex, similar to solution with lxml
0.0233258696287
['Catalina 320', 'Catalina 320']
----------------------------------
Solution 3 with regex
0.00784708671074
['Catalina 320', 'Catalina 320']
What is wrong in my solutions with regex ??
Times:
lxml - 100 %
solution 1 - 8.1 %
solution 2 - 7.7 %
solution 3 - 2.6 %
Using a regex doesn't requires the text to be an XML or HTML text.
.
So, what are the remaining arguments to pretend that regexes are inferior to lxml to treat this problem ??
EDIT 1
The solution with rigx = re.compile('(Catalina 320)(?:(?:.(?!Catalina 320))*These boats.*\Z)?',re.DOTALL)
isn't good:
this regex will catch the occurences of 'Catalina 320' situated AFTER 'These boats' IF there are no occurences of 'Catalina 320' BEFORE 'These boats'
The pattern must be:
rigx = re.compile('(<tag>Catalina 320</tag>)(?:(?:.(?!<tag>Catalina 320</tag>))*These boats.*\Z)?|These boats.*\Z',re.DOTALL)
But this is a rather complicated pattern compared to other solutions