This one is for legitimate lxml
gurus. I have a web scraping application where I want to iterate over a number of div.content
(content is the class) tags on a website. Once in a div.content tag, I want to see if there are any <a>
tags that are the children of <h3>
elements. This seems relatively simple by just trying to create a list using XPath
from the div.cont tag, i.e.,
linkList = tree.xpath('div[contains(@class,"cont")]//h3//a')
The problem is, I then want to create a tuple
that contains the link from the div.content box as well as the text from the paragraph element of the same div.content box. I could obviously iterate over the whole document and store all of the paragraph text as well as all of the links, but I wouldn't have any real way of matching the appropriate paragraphs to the <a>
tags.
lxml's Element.iter()
function could ALMOST achieve this by iterating over all of the div.cont
elements, ignoring those without <a>
tags, and pairing up the paragraph/a
combos, but unfortunately there doesn't seem to be any option for iterating over class names, only tag names, with that method.
Edit: here's an extremely stripped down version of the HTML I want to parse:
<body>
<div class="cont">
<h1>Random Text</h1>
<p>The text I want to obtain</p>
<h3><a href="somelink">The link I want to obtain</a></h3>
</div>
</body>
There are a number of div.conts like this that I want to work with -- most of them have far more elements than this, but this is just a sketch to give you an idea of what I'm working with.