Iterating Over Elements and Sub Elements With lxml

Question

This one is for legitimate lxml gurus. I have a web scraping application where I want to iterate over a number of div.content (content is the class) tags on a website. Once in a div.content tag, I want to see if there are any <a> tags that are the children of <h3> elements. This seems relatively simple by just trying to create a list using XPath from the div.cont tag, i.e.,

linkList = tree.xpath('div[contains(@class,"cont")]//h3//a')

The problem is, I then want to create a tuple that contains the link from the div.content box as well as the text from the paragraph element of the same div.content box. I could obviously iterate over the whole document and store all of the paragraph text as well as all of the links, but I wouldn't have any real way of matching the appropriate paragraphs to the <a> tags.

lxml's Element.iter() function could ALMOST achieve this by iterating over all of the div.cont elements, ignoring those without <a> tags, and pairing up the paragraph/a combos, but unfortunately there doesn't seem to be any option for iterating over class names, only tag names, with that method.

Edit: here's an extremely stripped down version of the HTML I want to parse:

<body>
<div class="cont">
    <h1>Random Text</h1>
    <p>The text I want to obtain</p>
    <h3><a href="somelink">The link I want to obtain</a></h3>
</div>
</body>

There are a number of div.conts like this that I want to work with -- most of them have far more elements than this, but this is just a sketch to give you an idea of what I'm working with.

can you post some sample HTML? – isedev Jan 28 '13 at 22:02 — isedev, Jan 28 '13 at 22:02
okay, just posted a rough example – user1427661 Jan 28 '13 at 22:07 — user1427661, Jan 28 '13 at 22:07

score 3 · Accepted Answer · edited May 23 '17 at 11:51

3

You could just use a less specific XPath expression:

for matchingdiv in tree.xpath('div[contains(@class,"cont")]'):
    # skip those without a h3 > a setup.
    link = matchingdiv.xpath('.//h3//a')
    if not link:
        continue

    # grab the `p` text and of course the link.

You could expand this (be ambitious) and select for the h3 > a tags, then go to the div.cont ancestor (based off XPath query with descendant and descendant text() predicates):

for matchingdiv in tree.xpath('.//h3//a/ancestor::*[self::div[contains(@class,"cont")]]'):
    # no need to skip anymore, this is a div.cont with h3 and a contained
    link = matchingdiv.xpath('.//h3//a')

    # grab the `p` text and of course the link

but since you need to then scan for the link anyway that doesn't actually buy you anything.

edited May 23 '17 at 11:51

Community

1
1

answered Jan 28 '13 at 22:26

Martijn Pieters

1,048,767
296
4,058
3,343

This looks solid. Would it still work if
weren't an immediate child of class.cont? For instance, if
were contained within a wrapper like
or something?
– user1427661 Jan 28 '13 at 22:40
1

@user1427661: That's what the `.//` prefix does; search for a descendant of the current element (not just a child). – Martijn Pieters Jan 28 '13 at 22:41

Iterating Over Elements and Sub Elements With lxml

1 Answers1

weren't an immediate child of class.cont? For instance, if

were contained within a wrapper like or something?

were contained within a wrapper like
or something?