0

I'm trying to extract a link in Python 3.4 with BeautifulSoup4, and there are no identifying element markers such as id, class, or etc.. However, before each link, there is a static string of text such as follows:

<h2>
 "Precluding-Text:"
  <a href="http://the-link-im-after.com">Varying Anchor Text</a>
</h2>

My end goal is to get the following output:

http://the-link-im-after.com/
alphazwest
  • 3,483
  • 1
  • 27
  • 39

2 Answers2

2

You can use that static text to locate the link:

soup.find(text="Precluding-Text:").find_next_sibling("a")["href"]

Or, you may need a partial text match:

soup.find(text=lambda text: text and "Precluding-Text:" in text).find_next_sibling("a")["href"]
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • Very clean solution. It's been a while since I last used BS; looks like it's time to RTFM on the latest version. – Peter Rowell Jun 15 '16 at 17:21
  • @PeterRowell bs4 is definitely worth looking at..it's a great example of a convenient, easily understandable and clean API..like "HTML parsing for humans" :) – alecxe Jun 15 '16 at 17:29
  • This is the approach that I viewed as being the most direct and efficient after reading the BS4 documentation, but I keep getting a NoneType Object Attribute error when attempting to further locate objects in this way. For instance: `soup.find(text="Precluding-Text:")` works find, but with the next instruction: `.find_next_sibling("a")['href']` I receive an error – alphazwest Jun 15 '16 at 17:34
  • @Frank well, it works on your sample: https://gist.github.com/alecxe/3b0a79214b25d084ec8e7f5702642729. – alecxe Jun 15 '16 at 17:36
  • Sorry, I think it was an issue with my approach, perhaps missing an added space in the content. The partial match solution (second one) works perfectly. Thanks! – alphazwest Jun 15 '16 at 17:40
0

Another solutions using python generators:

from bs4 import BeautifulSoup as soup
import re

html = """
<h2>
 "Precluding-Text:"
  <a href="http://the-link-im-after.com">Varying Anchor Text</a>
</h2>
"""

s = soup(html)
elements = s.find_all(text=re.compile('.*Precluding-Text:.*'))
if len(elements) == 0:
    print("not found")
else:
    for elem in elements:
        gen = elem.next_siblings
        a_tag = next(gen)
        if a_tag.get('href') is not None:
            print(a_tag.get('href'))
Community
  • 1
  • 1
Christos Papoulas
  • 2,469
  • 3
  • 27
  • 43