How can I control results returned by Python's re.findall() on an html string?

Question

I'm trying to capture all instances of "Catalina 320" SO LONG as they occur before the "These boats" string (see generic sample below).

I have the code to capture ALL instances of "Catalina 320" but I can't figure out how to stop it at the "These boats" string.

resultsArray = re.findall(r'<tag>(Catalina 320)</tag>', string, re.DOTALL)

Can anyone help me solve this missing piece? I tried adding '.+These boats' but it didn't work.

Thanks- JD

  Blah blah blah
    <tag>**Catalina 320**</tag>
  Blah
    <td>**Catalina 320**</td>
  Blah Blah 
    <tag>**These boats** are fully booked for the day</tag>
  Blah blah blah
    <tag>Catalina 320</tag>
    <tag>Catalina 320</tag>

Are you suggesting that we will find a text literal ("Blah Blah") just after a element, or is that a mistake made while generalizing the question? — Mike Pennington, Jun 25 '11 at 19:00
It would be fair to obtain the answer to the Mike's question. I would also be interested to know if your text is really an SGML text because it is the required basis of the solution of Mike — eyquem, Jun 26 '11 at 11:03
Moreover, the elements before 'These boats' in your sample have contents `**Catalina 320**` while your regex's pattern only contain `Catalina 320`. What do you want to catch , precisely ? Also, do you want to catch some strings **preceding the string** 'These boats' wherever they are **OR** some strings **preceding the element** containing 'These boats' ?? If an element is `**Catalina 320** is one of These boats` , must the desired string lying before 'These boats' in this element be catched ? — eyquem, Jun 26 '11 at 11:16

Mike Pennington · Answer 1 · 2011-06-26T11:30:48.733

3

You could solve this with a regular expression, but regex isn't required based on the way that you stated problem^{See End Note 1}.

You should use lxml to parse this...

import lxml.etree as ET
from lxml.etree import XMLParser

resultsArray = []
parser = XMLParser(ns_clean=True, recover=True)
tree = ET.parse('foo.html', parser)   # See End-Note 2
for elem in tree.findall("//"):
    if "These boats" in elem.text:
        break
    elif "Catalina 320" in elem.text:
        resultsArray.append(ET.tostring(elem).strip())


print resultsArray

Executing this yields:

[mpenning@Bucksnort ~]$ python foo.py
['<tag>**Catalina 320**</tag>', '<td>**Catalina 320**</td>']
[mpenning@Bucksnort ~]$

End Notes:

The current version of your question doesn't have valid markup, but I assumed you have either xml or html (which was what you had in version 1 of the question)... my answer can handle your text as-written, but it makes more sense to assume some kind of structure markup, so I used the following input text, which I saved locally as foo.html:
```
     <body>
<tag>Blah blah blah</tag>
    <tag>**Catalina 320**</tag>
  <tag>Blah<tag>
    <td>**Catalina 320**</td>
  </tag>Blah Blah </tag>
    <tag>**These boats** are fully booked for the day</tag>
  <tag>Blah blah blah</tag>
    <tag>Catalina 320</tag>
    <tag>Catalina 320</tag>
    </body>
```
If you want to be a bit more careful about encoding issues, you can use lxml.soupparser as a fallback when parsing HTML with lxml

from lxml.html import soupparser
# ...
try:
    parser = XMLParser(ns_clean=True, recover=True)
    tree = ET.parse('foo.html', parser)
except UnicodeDecodeError:
    tree = soupparser.parse('foo.html')

edited Jun 26 '11 at 11:30

answered Jun 25 '11 at 08:53

Mike Pennington

41,899
19
136
174

1

I would go one step further, and say that, at least with the information provided until now, using regular expressions is the **wrong** way of solving this problem. Using a parser, in this case, lxml or similar, would be the way to go... the web is full of literature like this one for good reason: http://www.codinghorror.com/blog/2008/06/regular-expressions-now-you-have-two-problems.html – ashwoods Jun 25 '11 at 09:38
@ashwoods _"using regular expressions is the wrong way of solving this problem"_ Why ?? It would be interesting to have explanation justifying this peremptory opinion – eyquem Jun 25 '11 at 16:08
@Mike Pennington Using **lxml** for this very simple problem is like using an helicopter to go on the opposite side of a street – eyquem Jun 25 '11 at 16:16
1

@eyequem, the OP is parsing the tag-delimited contents of an SGML-based markup language. This is *the primary reason* **anyone** should use `lxml` – Mike Pennington Jun 25 '11 at 16:58
@Mike Pennington Your solution catches as well the elements `'**Catalina 320**'` as the elements `'Catalina 320'` or `'Catalina 320 the best one'` or even `'Catalina 320 the best one of These boats'` – eyquem Jun 26 '11 at 10:56
@eyequem, think about the solution the OP accepted. It does the same thing as my answer, just without capturing tags... which the OP seems not to have cared about, even though his example was `resultsArray = re.findall(r'(Catalina 320)', string, re.DOTALL)` and his first post didn't have generalized `` elements in it, they were all `` elements. Anyway he selected an answer, and it selects the same boats as my answer ;-) – Mike Pennington Jun 26 '11 at 11:07

Tugrul Ates · Accepted Answer · 2011-06-26T08:43:28.363

2

If there is no other context to your problem, you can just search before the first occurrence of 'These boats':

re.findall('Catalina 320', string.split('These boats')[0])

edited Jun 26 '11 at 08:43

answered Jun 25 '11 at 07:16

Tugrul Ates

9,451
2
33
59

Brilliant! Thanks. very simple indeed. – jond Jun 26 '11 at 07:56

user815091 · Answer 3 · 2011-06-25T07:50:22.650

0

groups = re.findall(r'(Catalina 320)*.*These boats, r.read(), re.DOTALL)

the first group in groups will contain the list of Catalina 320 matches.

edited Jun 25 '11 at 07:50

answered Jun 25 '11 at 07:00

user815091

1
1

Many thanks. Sadly it's returning an empty list. (And I added the apostrophe after "boats"). – jond Jun 25 '11 at 07:18

score -1 · Answer 4 · edited Jun 20 '20 at 09:12

-1

With file of name 'foo.html' containing

     <body>
<tag>Blah blah blah</tag>
    <tag>**Catalina 320**</tag>
  <tag>Blah<tag>
    <td>**Catalina 320**</td>
  </tag>Blah Blah </tag>
    <tag>**These boats** are fully booked for the day</tag>
  <tag>Blah blah blah</tag>
    <tag>Catalina 320</tag>
    <tag>Catalina 320</tag>
    </body>

code:

from time import clock
n = 1000


########################################################################

import lxml.etree as ET
from lxml.etree import XMLParser

parser = XMLParser(ns_clean=True, recover=True)
etree = ET.parse('foo.html', parser)

te = clock()
for i in xrange(n):
    resultsArray = []
    for thing in etree.findall("//"):
        if "These boats" in thing.text:
            break
        elif "Catalina 320"in thing.text:
            resultsArray.append(ET.tostring(thing).strip())
tf = clock()

print 'Solution with lxml'
print tf-te,'\n',resultsArray


########################################################################

with open('foo.html') as f:
    text = f.read()
    
import re


print '\n\n----------------------------------'
rigx = re.compile('(Catalina 320)(?:(?:.(?!Catalina 320))*These boats.*\Z)?',re.DOTALL)

te = clock()
for i in xrange(n):
    yi = rigx.findall(text)
tf = clock()

print 'Solution 1 with a regex'
print tf-te,'\n',yi


print '\n----------------------------------'

ragx = re.compile('(Catalina 320)|(These boats)')

te = clock()
for i in xrange(n):
    li = []
    for mat in ragx.finditer(text):
        if mat.group(2):
            break
        else:
            li.append(mat.group(1))
tf = clock()

print 'Solution 2 with a regex, similar to solution with lxml'
print tf-te,'\n',li


print '\n----------------------------------'

regx = re.compile('(Catalina 320)')

te = clock()
for i in xrange(n):
    ye = regx.findall(text, 0, text.find('These boats') if 'These boats' in text else len(text)) 
tf = clock()

print 'Solution 3 with a regex'
print tf-te,'\n',ye

result

Solution with lxml
0.30324105438 
['<tag>**Catalina 320**</tag>', '<td>**Catalina 320**</td>']


----------------------------------
Solution 1 with regex
0.0245033935877 
['Catalina 320', 'Catalina 320']

----------------------------------
Solution 2 with a regex, similar to solution with lxml
0.0233258696287
['Catalina 320', 'Catalina 320']

----------------------------------
Solution 3 with regex
0.00784708671074 
['Catalina 320', 'Catalina 320']

What is wrong in my solutions with regex ??

Times:

lxml - 100 %

solution 1 - 8.1 %

solution 2 - 7.7 %

solution 3 - 2.6 %

Using a regex doesn't requires the text to be an XML or HTML text.

.

So, what are the remaining arguments to pretend that regexes are inferior to lxml to treat this problem ??

EDIT 1

The solution with rigx = re.compile('(Catalina 320)(?:(?:.(?!Catalina 320))*These boats.*\Z)?',re.DOTALL) isn't good:

this regex will catch the occurences of 'Catalina 320' situated AFTER 'These boats' IF there are no occurences of 'Catalina 320' BEFORE 'These boats'

The pattern must be:

rigx = re.compile('(<tag>Catalina 320</tag>)(?:(?:.(?!<tag>Catalina 320</tag>))*These boats.*\Z)?|These boats.*\Z',re.DOTALL)

But this is a rather complicated pattern compared to other solutions

edited Jun 20 '20 at 09:12

Community

1
1

answered Jun 25 '11 at 13:56

eyquem

26,771
7
38
46

1

if you want to post a question, using an answer isn't the right place. The question above is a little ambiguous, has questions often are that provide little context and only a generic example. But it seems that he is trying to parse either html or xml. Maybe its a regular expression example for school. But _if_ he is trying to parse html, for very isolated cases you might justify using regex, but its generally considered a bad idea: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – ashwoods Jun 25 '11 at 14:23
@ashwoods I don't know if you were conscious of the eventuality, but I find unfair trying to do so that I appear to have a question, hence a problem. NO, I have no question, I have no problem, I don't use an answer to ask a independant question. I consider that a regex is quite good for THIS problem and I ask a question relating to the problem as exposed by the OP. – eyquem Jun 25 '11 at 15:53
@ashwoods The OP's question isn't ambiguous. The problem is very easy to understand, and with regex very easy to solve. Mike Pennington and you, WANT to use **lxml** because you WANT to NOT use regexes, that's why you are obliged to modify the text because it isn't an XML/HTML text ! _"When you have a hammer, all the problems seem to be nails"_ This saying matches the implied Zawinski's quote taken as a universal saying justifying the very common bias in resolution of problems that leads a lot of people to automatically think that they must try FIRST to avoid the use of the regex tool. – eyquem Jun 25 '11 at 15:56
@ashwoods There is no relation of the very good bobince's post 1732348 with the OP's problem, since the OP's problem is independent of the nature of the analyzed text (XML/HTML or not), for the problem doesn't matter of the tags. By the way , the bobince 's post is 95 % literary. The 5% are in this: _"HTML is not a regular language and hence cannot be parsed by regular expressions."_ I accept this explanation. But I dispute the fact that it is correctly usable for the OP's question: he doesn't want to parse a text, he wants to finds some occurences, that's all. – eyquem Jun 25 '11 at 16:01
@eyequem, your timed solutions are not directly comparable; my answer includes the tags, which is what the OP's regexp is doing. Besides, real pythonistas use `timeit` to measure performance; furthermore, you are cheating because you compiled the regex outside the timing loop ;-) – Mike Pennington Jun 25 '11 at 17:07
@Mike Pennington **my answer includes the tags,** YES **which is what the OP's regexp is doing** NO: _"If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. "_ docs.python.org/library/re.html#re.findall . It would be more correctly expressed like that: _If one group is defined with parenthesises in the pattern compiled to create a RegexObject, return a list of the groups matched that are present in the analysed string. If more than one group are defined in the pattern, return a list of tuples of groups_ – eyquem Jun 26 '11 at 10:39
@Mike Pennington By the way, I realize that your solution catches the elements having markups **'...'** as well as the elements having markups **'...'** , while the OP's question expresses he wants only the elements of one type of markup. – eyquem Jun 26 '11 at 10:50
@Mike Pennington Moreover, I modified the patterns used to define regexes in my solutions, writing '(Catalina 320)' instead of '(Catalina 320)'. Evidently, there is no difference obtained in the new timings I performed with this. When I wrote the code to compare the timings, I took attention to not measure uninteresting and uncomparable portions of time: I put the definition of regex like ``ragx = re.compile('(Catalina 320)|(These boats)')`` and the definitions of **parser** and **etree** in your code outside of the block whose execution's time is measured. – eyquem Jun 26 '11 at 11:27
@Mike Pennington Surely, I am not a real pythonysta but I prefer to employ **clock** because if feel no need to use **timeit**. I used **timeit** in the past but I was always having difficulties to remind its precise syntax without watching in the docs, and I noticed that the results were not more accurate than those obtained with **clock**. Using **timeit** to perform the comparison I did won't give different results from those I obtained; that's what I believe. It requires to be verified.If you want to do the comparison with the help of **timeit** , I will be interested but I let you do that – eyquem Jun 26 '11 at 11:34
@eyquem: First, I love and use regular expressions. Second, I understood your post quite well, I was just criticizing the the fact that if you copy it verbatim and paste it as question here on stack, it would be, the following question: why would i want to use a parser instead of regex here if regex is faster. Do so, and you might get some interesting answers. . – ashwoods Jun 26 '11 at 20:02
@eyquem: Yes, the question might be very straightfoward, but only if you ignore that problems/questions have a context, and this question provides little. What I said is: **if** he is trying to parse html, if his code -the code we see, and don't see- deals or might deal with html/xml semantics, now or in the future, than it is correct to point out that lxml probably is a better tool, and I pointed to that particular link because *that* question has already been discussed on the web and on stack over and over again... – ashwoods Jun 26 '11 at 20:28

How can I control results returned by Python's re.findall() on an html string?

4 Answers4

So, what are the remaining arguments to pretend that regexes are inferior to lxml to treat this problem ??

EDIT 1