19

I tried soup.find('!--') but it doesn't seem to work. Thanks in advance.

Edit: Thanks for the tip on how to find all comments. I have a follow up question. How do I specifically search out for a comment?

For example, I have the following comment tag:

<!-- <span class="titlefont"> <i>Wednesday 110518</i>(05:00PM)<br /></span> -->

I really just want this stuff <i>Wednesday 110518</i>. The "110518" is the date YYMMDD which I'm leaning on using as my search target. However, I don't know how to find something within a specific comment tag.

1stsage
  • 257
  • 1
  • 2
  • 7

2 Answers2

24

You can find all the comments in a document with via the findAll method. See this example showing how to do exactly what you're trying to do Removing elements:

In brief, you want this:

comments = soup.findAll(text=lambda text:isinstance(text, Comment))

Edit: If you're trying to search within the columns, you can try:

import re
comments = soup.findAll(text=lambda text:isinstance(text, Comment))
for comment in comments:
  e = re.match(r'<i>([^<]*)</i>', comment.string).group(1)
  print e
yan
  • 20,644
  • 3
  • 38
  • 48
  • How about searching for a specific comment? I'm trying to search for this in the html file: Notice the 110518, that's just the date in yymmdd, how can I search just for the information within that comment tag, and specifically just within the ? – 1stsage May 19 '11 at 17:19
  • @1stsage Perhaps you want to add that requirement to your question. – Steven Rumbalski May 19 '11 at 17:23
  • 1stsage, updated my post for your specific case. Next time, make sure your question encompasses what you're trying to do. – yan May 19 '11 at 17:28
  • @1stsage With regards to searching the contents of the comment, if it's valid html you could parse that as well. Or you could use string methods or even regular expressions. With such a small blob of text and simple requirement I'd settle for a regular expression (something like `r'\(.*?)\'`). – Steven Rumbalski May 19 '11 at 17:31
0

Pyparsing allows you to search for HTML comments using a builtin htmlComment expression, and attach parse-time callbacks to validate and extract the various data fields within the comment:

from pyparsing import makeHTMLTags, oneOf, withAttribute, Word, nums, Group, htmlComment
import calendar

# have pyparsing define tag start/end expressions for the 
# tags we want to look for inside the comments
span,spanEnd = makeHTMLTags("span")
i,iEnd = makeHTMLTags("i")

# only want spans with class=titlefont
span.addParseAction(withAttribute(**{'class':'titlefont'}))

# define what specifically we are looking for in this comment
weekdayname = oneOf(list(calendar.day_name))
integer = Word(nums)
dateExpr = Group(weekdayname("day") + integer("daynum"))
commentBody = '<!--' + span + i + dateExpr("date") + iEnd

# define a parse action to attach to the standard htmlComment expression,
# to extract only what we want (or raise a ParseException in case 
# this is not one of the comments we're looking for)
def grabCommentContents(tokens):
    return commentBody.parseString(tokens[0])
htmlComment.addParseAction(grabCommentContents)


# let's try it
htmlsource = """
want to match this one
<!-- <span class="titlefont"> <i>Wednesday 110518</i>(05:00PM)<br /></span> -->

don't want the next one, wrong span class
<!-- <span class="bodyfont"> <i>Wednesday 110519</i>(05:00PM)<br /></span> -->

not even a span tag!
<!-- some other text with a date in italics <i>Wednesday 110520</i>(05:00PM)<br /></span> -->

another matching comment, on a different day
<!-- <span class="titlefont"> <i>Thursday 110521</i>(05:00PM)<br /></span> -->
"""

for comment in htmlComment.searchString(htmlsource):
    parsedDate = comment.date
    # date info can be accessed like elements in a list
    print parsedDate[0], parsedDate[1]
    # because we named the expressions within the dateExpr Group
    # we can also get at them by name (this is much more robust, and 
    # easier to maintain/update later)
    print parsedDate.day
    print parsedDate.daynum
    print

Prints:

Wednesday 110518
Wednesday
110518

Thursday 110521
Thursday
110521
PaulMcG
  • 62,419
  • 16
  • 94
  • 130
  • The latest version of pyparsing now includes `withClass` to simplify that `withAttribute` ugliness. – PaulMcG Apr 20 '16 at 01:16