Beautifulsoup/Python - Extract Link URL From Div, Dependent on Precluding Content

Question

I'm trying to extract a link in Python 3.4 with BeautifulSoup4, and there are no identifying element markers such as id, class, or etc.. However, before each link, there is a static string of text such as follows:

<h2>
 "Precluding-Text:"
  <a href="http://the-link-im-after.com">Varying Anchor Text</a>
</h2>

My end goal is to get the following output:

http://the-link-im-after.com/

score 2 · Accepted Answer · answered Jun 15 '16 at 16:16

2

You can use that static text to locate the link:

soup.find(text="Precluding-Text:").find_next_sibling("a")["href"]

Or, you may need a partial text match:

soup.find(text=lambda text: text and "Precluding-Text:" in text).find_next_sibling("a")["href"]

answered Jun 15 '16 at 16:16

alecxe

462,703
120
1,088
1,195

Very clean solution. It's been a while since I last used BS; looks like it's time to RTFM on the latest version. – Peter Rowell Jun 15 '16 at 17:21
@PeterRowell bs4 is definitely worth looking at..it's a great example of a convenient, easily understandable and clean API..like "HTML parsing for humans" :) – alecxe Jun 15 '16 at 17:29
This is the approach that I viewed as being the most direct and efficient after reading the BS4 documentation, but I keep getting a NoneType Object Attribute error when attempting to further locate objects in this way. For instance: `soup.find(text="Precluding-Text:")` works find, but with the next instruction: `.find_next_sibling("a")['href']` I receive an error – alphazwest Jun 15 '16 at 17:34
@Frank well, it works on your sample: https://gist.github.com/alecxe/3b0a79214b25d084ec8e7f5702642729. – alecxe Jun 15 '16 at 17:36
Sorry, I think it was an issue with my approach, perhaps missing an added space in the content. The partial match solution (second one) works perfectly. Thanks! – alphazwest Jun 15 '16 at 17:40

score 0 · Answer 2 · edited May 23 '17 at 12:31

Another solutions using python generators:

from bs4 import BeautifulSoup as soup
import re

html = """
<h2>
 "Precluding-Text:"
  <a href="http://the-link-im-after.com">Varying Anchor Text</a>
</h2>
"""

s = soup(html)
elements = s.find_all(text=re.compile('.*Precluding-Text:.*'))
if len(elements) == 0:
    print("not found")
else:
    for elem in elements:
        gen = elem.next_siblings
        a_tag = next(gen)
        if a_tag.get('href') is not None:
            print(a_tag.get('href'))

Beautifulsoup/Python - Extract Link URL From Div, Dependent on Precluding Content

2 Answers2