Find following tag with pyparsing

Question

I'm using pyparsing to parse HTML. I'm grabbing all embed tags, but in some cases there's an a tag directly following that I also want to grab if it's available.

example:

import pyparsing
target = pyparsing.makeHTMLTags("embed")[0]
target.setParseAction(pyparsing.withAttribute(src=pyparsing.withAttribute.ANY_VALUE))
target.ignore(pyparsing.htmlComment)

result = target.searchString(""".....
   <object....><embed>.....</embed></object><br /><a href="blah">blah</a>
   """)

I haven't been able to find any character offset in the result objects, otherwise I could just grab a slice of the original input string and work from there.

EDIT:

Someone asked why I don't use BeautifulSoup. That's a good question, let me show you why I chose not to use it with a code sample:

import BeautifulSoup
import urllib
import re
import socket

socket.setdefaulttimeout(3)

# get some random blogs
xml = urllib.urlopen('http://rpc.weblogs.com/shortChanges.xml').read()

success, failure = 0.0, 0.0

for url in re.compile(r'\burl="([^"]+)"').findall(xml)[:30]:
    print url
    try:
        BeautifulSoup.BeautifulSoup(urllib.urlopen(url).read())
    except IOError:
        pass
    except Exception, e:
        print e
        failure += 1
    else:
        success += 1


print failure / (failure + success)

When I try this, BeautifulSoup fails with parse errors 20-30% of the time. These aren't rare edge cases. pyparsing is slow and cumbersome but it hasn't blown up no matter what I throw at it. If I can be enlightened as to a better way to use BeautifulSoup then I would be really interested in knowing that.

That's very odd: I used your exact code, ran it at three different times, and it successfully parsed all 90 URLs that resulted. I'm on Python 2.5.4 on Windows with BeautifulSoup 3.0.7a. What errors are you seeing? — Ned Batchelder, Nov 20 '09 at 16:53
python 2.5.1 on OS X, BeautifulSoup 3.1.0.1 . Most common errors are `bad end tag: u""` and `malformed start tag`. — ʞɔıu, Nov 20 '09 at 16:57

score 5 · Accepted Answer · edited Nov 20 '09 at 15:09

5

If there is an optional <a> tag that would be interesting if it follows an <embed> tag, then add it to your search pattern:

embedTag = pyparsing.makeHTMLTags("embed")[0]
aTag = pyparsing.makeHTMLTags("a")[0]
target = embedTag + pyparsing.Optional(aTag)
result = target.searchString(""".....   
    <object....><embed>.....</embed></object><br /><a href="blah">blah</a>
    """)

print result.dump()

If you want to capture the character location of an expression within your parser, insert one of these, with a results name:

loc = pyparsing.Empty().setParseAction(lambda s,locn,toks: locn)
target = loc("beforeEmbed") + embedTag + loc("afterEmbed") + 
                                                 pyparsing.Optional(aTag)

edited Nov 20 '09 at 15:09

ʞɔıu

47,148
35
106
149

answered Nov 20 '09 at 04:02

PaulMcG

62,419
16
94
130

The loc thing worked but I couldn't seem to get the Optional thing to work. Are you sure that code sample works? – ʞɔıu Nov 20 '09 at 16:08
Well *that* example doesn't work, because the `` tag *doesn't* immediately follow the `` tag. I didn't follow what you meant by *follow*. What do you mean by *follow*? – PaulMcG Nov 20 '09 at 21:21
In the example, the embed tag is followed by some stuff, shown by ellipses, a close-embed tag, a close-object tag, an empty BR tag, and *then* the A tag. – PaulMcG Nov 20 '09 at 21:23
could I do something like embedTag + skipTo(endEmbedTag) + Optional(endObjectTag + brTag + aTag) ? – ʞɔıu Nov 21 '09 at 22:13
`embedTag + pyparsing.SkipTo(endEmbedTag, include=True) + pyparsing.Optional(endObjectTag + brTag + aTag)` should work for *this specific case*. But I would not be surprised if your HTML had other tags in there in unpredictable places. If you want to match an `` tag that is followed by an `` tag, this might be a little more robust: `embedTag + pyparsing.SkipTo(aTag, failOn=embedTag) + aTag | embedTag`. In this case, SkipTo advances directly to the next aTag, but fails if there is another embedTag found first. But I'm in Pure Speculation Land here, so you have to fill in the rest. – PaulMcG Nov 21 '09 at 23:33

score 1 · Answer 2 · answered Nov 20 '09 at 01:02

1

Why would you write your own HTML parser? The standard library includes HTMLParser, and BeautifulSoup can handle any job HTMLParser can't.

answered Nov 20 '09 at 01:02

Ned Batchelder

364,293
75
561
662

1

I know what pyparsing is, I just wonder why you would use it for the messy job of parsing HTML when existing specialized modules already exist. – Ned Batchelder Nov 20 '09 at 01:16
I updated the question with the reason why I don't use BeautifulSoup. Short answer: because BeautifulSoup gets lots of parse errors, but I don't have the same problem with pyparsing. If there's a better way to use BeautifulSoup that I don't know about or there's something else I'm missing I would be really interested in learning about that, however. – ʞɔıu Nov 20 '09 at 15:14

score 1 · Answer 3 · answered Nov 20 '09 at 15:21

1

you don't prefer using normal regex? or because its bad habit to parse html? :D

re.findall("<object.*?</object>(?:<br /><a.*?</a>)?",a)

answered Nov 20 '09 at 15:21

YOU

120,166
34
186
219

1

everyone on SO now knows that parsing HTML with regex is a crime against Man; cite: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – ʞɔıu Nov 20 '09 at 15:35
:D I see, thats my first impression within these 2 days when I joined too. – YOU Nov 20 '09 at 15:45
I'm actually not opposed to using regex per se, but I've used that approach in the past and I'm trying to learn a better way. I could do that but I would still need/want a parser to parse out the HTML attributes, etc, and I may end up using a hybrid approach using a little of both. – ʞɔıu Nov 20 '09 at 16:06

score 1 · Answer 4 · answered Nov 20 '09 at 19:48

1

I was able to run your BeautifulSoup code and received no errors. I'm running BeautifulSoup 3.0.7a

Please use BeautifulSoup 3.0.7a; 3.1.0.1 has bugs that prevent it from working at all in some cases (such as yours).

answered Nov 20 '09 at 19:48

gibson

375
2
12

Would have added this as a comment on the first question but I don't have enough rep. – gibson Nov 20 '09 at 19:49

Find following tag with pyparsing

4 Answers4