Select parent of specific node using xpath/python

Question

How do I get the href value for the a in this snippet of html?

I need to get it based on that class in i tag

<!--
<a href="https://link.com" target="_blank"><i class="foobar"></i>  </a>           
-->

I tried this, but am getting no results

foo_links = tree.xpath('//a[i/@class="foobar"]')

score 1 · Answer 1 · edited May 23 '17 at 12:02

Your code does work for me — it returns a list of <a>. If you want a list of hrefs not the element itself, add /@href:

hrefs = tree.xpath('//a[i/@class="foobar"]/@href')

You could also first find the <i>s, then use /parent::* (or simply /..) to get back to the <a>s.

hrefs = tree.xpath('//a/i[@class="foobar"]/../@href')
#                     ^                    ^  ^
#                     |                    |  obtain the 'href'
#                     |                    |
#                     |                    get the parent of the <i>
#                     |
#                     find all <i class="foobar"> contained in an <a>.

If all of these don't work, you may want to verify if the structure of the document is correct.

Note that XPath won't peek inside comments . If the <a> is indeed inside the comments , you need to manually extract the document out first.

hrefs = [href for comment in tree.xpath('//comment()') 
              # find all comments
              for href in lxml.html.fromstring(comment.text)
              # parse content of comment as a new HTML file
                              .xpath('//a[i/@class="foobar"]/@href')
                              # read those hrefs.
]

@svasa OP said "*I need to get it based on that class in i tag*" — kennytm, Apr 13 '17 at 15:20

Andersson · Answer 2 · 2017-04-13T15:48:06.107

0

You should note that target element is HTML comment. You cannot simply get <a> from comment with XPath like "//a" as in this case it's not a node, but simple string.

Try below code:

import re

foo_links = tree.xpath('//comment()') # get list of all comments on page
for link in foo_links:
    if '<i class="foobar">' in link.text:
        href = re.search('\w+://\w+.\w+', link.text).group(0) # get href value from required comment
        break

P.S. You might need to use more complex regular expression to match link URL

edited Apr 13 '17 at 15:48

answered Apr 13 '17 at 15:35

Andersson

51,635
17
77
129

This seems to be working the best. The comments/ – Brett Webb Apr 14 '17 at 14:59
Removed the `break` and i'm getting what I was after – Brett Webb Apr 14 '17 at 15:18

Select parent of specific node using xpath/python

2 Answers2