1

I'm trying to extract every HTML tag including a match for a regular expression. For example, suppose I want to get every tag including the string "name" and I have a HTML document like this:

<html>
  <head>
    <title>This tag includes 'name', so it should be retrieved</title>
  </head>
  <body>
    <h1 class="name">This is also a tag to be retrieved</h1>
    <h2>Generic h2 tag</h2>
  </body>
</html>

Probably, I should try a regular expression to catch every match between opening and closing "<>", however, I'd like to be able to traverse the parsed tree based on those matches, so I can get the siblings or parents or 'nextElements'. In the example above, that amounts to get <head>*</head> or maybe <h2>*</h2> once I know they're parents or siblings of a tag containing the match.

I tried BeautifulSoap but it seems to me it's useful when you already know what kind of tag you're looking for or based on its contents. In this case, I want to get a match first, take that match as a starting point and then navigate the tree as BeautifulSoap and other HTML parsers are able to do.

Suggestions?

r_31415
  • 8,752
  • 17
  • 74
  • 121
  • Using Regex on Html is HARD. I don't suggest you go down this path. What are you trying to do with the Html? See this article: http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html – AdamV Feb 09 '12 at 19:55
  • I don't think you've really though through this totally. What about: `` or `

    My name is beerbajay

    `? What do you expect should be returned?
    – beerbajay Feb 09 '12 at 20:05
  • @beerbajay No, it's ok if a retrieve that input tag (since it contains 'name'. Obviously, my real example is not using 'name' as a match. – r_31415 Feb 09 '12 at 20:25
  • @AdamD Thanks for link :-). I'm trying to get a match and obtain some content near that match to serve as context for further analysis. Using HTML tags makes it much more elegant even if it turns out to be more difficult. – r_31415 Feb 09 '12 at 20:27

2 Answers2

2

Use lxml.html. It's a great parser, it support xpath which can express anything you'd want easily.

The example below uses this xpath expression:

//*[contains(text(),'name']/parent::*/following-sibling::*[1]/*[@class='name']/text()

That means, in english:

Find me any tag that contains the word 'name' in its text, then get the parent, and then the next sibling, and find inside that any tag with the class 'name' and finally return the text content of that.

The result of running the code is:

['This is also a tag to be retrieved']

Here's the full code:

text = """
<html>
  <head>
    <title>This tag includes 'name', so it should be retrieved</title>
  </head>
  <body>
    <h1 class="name">This is also a tag to be retrieved</h1>
    <h2>Generic h2 tag</h2>
  </body>
</html>
"""

import lxml.html
doc = lxml.html.fromstring(text)
print doc.xpath('//*[contains(text(), $stuff)]/parent::*/'
    'following-sibling::*[1]/*[@class=$stuff]/text()', stuff='name')

Obligatory read, the "please don't parse HTML with regex" answer is here: https://stackoverflow.com/a/1732454/17160

Community
  • 1
  • 1
nosklo
  • 217,122
  • 57
  • 293
  • 297
  • Oh, that looks great (despite the scary syntax). Just to be sure, what I want is "Find me any tag that contains the word 'name' (I don't care whether it's in a TextNode or in an attribute, whatever), then get the parent, and then the next sibling. Do the same with any other tag containing the word 'name'. So I think I don't need the last part matching text from a tag with "class='name'". And text() works for every part of the tag, not only its TextNode, right? – r_31415 Feb 09 '12 at 20:22
  • Uhm, looks like "contains(text(), $stuff)..." only gets the first tag. It should retrieve also the same because it has "name" in it. I tried "contains(*, $stuff)..." but I only get the first two tags (html, head). Do you know how to also get the second tag? – r_31415 Feb 09 '12 at 23:45
  • I think this solves it: doc.xpath("//*[contains(text(),'name')]|//*[@*='name']") – r_31415 Feb 10 '12 at 00:08
1

Given the following conditions:

  • The match must occur in value of an attribute on the tag
  • The match must occur in a text node which is a direct child of the tag

You can use beautiful soup:

from bs4 import BeautifulSoup
from bs4 import NavigableString
import re

html = '''<html>
  <head>
    <title>This tag includes 'name', so it should be retrieved</title>
  </head>
  <body>
    <h1 class="name">This is also a tag to be retrieved</h1>
    <h2>Generic h2 tag</h2>
  </body>
</html>'''

soup = BeautifulSoup(html)
p = re.compile("name")

def match(patt):
    def closure(tag):
        for c in tag.contents:
            if isinstance(c, NavigableString):
                if patt.search(unicode(c)):
                    return True
        for v in tag.attrs.values():
            if patt.search(v):
                return True
    return closure

for t in soup.find_all(match(p)):
    print t

Output:

<title>This tag includes 'name', so it should be retrieved</title>
<h1 class="name">This is also a tag to be retrieved</h1>
beerbajay
  • 19,652
  • 6
  • 58
  • 75
  • Thanks for your answer. Are you sure about this code?. Shouldn't be 'findAll' instead of 'find_all'?. Still, I get the following error: "AttributeError: 'list' object has no attribute 'values'". I think you're not passing any value to closure(tag). – r_31415 Feb 09 '12 at 21:02
  • Sorry, I didn't mention, the code uses `bs4`, which is BeautifulSoup4, which is newly released. – beerbajay Feb 09 '12 at 21:55
  • Is that it?. I changed "from bs4 import NavigableString" to "from BeautifulSoup import NavigableString", and it didn't complain but the AttributeError remains – r_31415 Feb 09 '12 at 22:18
  • You could just install BeautifulSoup4 and use the code as-is: `easy_install beautifulsoup4` – beerbajay Feb 09 '12 at 23:25
  • Unfortunately, I can't. I'm stuck with Python 2.7 and bs4 only works for Python 3+, right? – r_31415 Feb 10 '12 at 00:09
  • This is what I read: "If you're using Python 3.x, you must use Beautiful Soup 4. If you like trying new things, I recommend you give the beta a spin. Otherwise, I recommend you use Beautiful Soup 3.2 until the beta period ends." (http://www.crummy.com/software/BeautifulSoup/) – r_31415 Feb 10 '12 at 00:47
  • I installed it on python 2.7 without any problem. Also, the documentation states: "The examples in this documentation should work the same way in Python 2.7 and Python 3.2." so I don't think it matters. You could, of course, invest a little time in figuring out how to backport this to bs3. It shouldn't be difficult. – beerbajay Feb 10 '12 at 07:42
  • Great to know that. Now I have more options. Thanks a lot. – r_31415 Feb 10 '12 at 17:13