I'm trying to extract every HTML tag including a match for a regular expression. For example, suppose I want to get every tag including the string "name" and I have a HTML document like this:
<html>
<head>
<title>This tag includes 'name', so it should be retrieved</title>
</head>
<body>
<h1 class="name">This is also a tag to be retrieved</h1>
<h2>Generic h2 tag</h2>
</body>
</html>
Probably, I should try a regular expression to catch every match between opening and closing "<>"
, however, I'd like to be able to traverse the parsed tree based on those matches, so I can get the siblings or parents or 'nextElements'. In the example above, that amounts to get <head>*</head>
or maybe <h2>*</h2>
once I know they're parents or siblings of a tag containing the match.
I tried BeautifulSoap but it seems to me it's useful when you already know what kind of tag you're looking for or based on its contents. In this case, I want to get a match first, take that match as a starting point and then navigate the tree as BeautifulSoap and other HTML parsers are able to do.
Suggestions?
My name is beerbajay
`? What do you expect should be returned? – beerbajay Feb 09 '12 at 20:05