23

I have the following:

  html =
  '''<div class=“file-one”>
    <a href=“/file-one/additional” class=“file-link">
      <h3 class=“file-name”>File One</h3>
    </a>
    <div class=“location”>
      Down
    </div>
  </div>'''

And would like to get just the text of href which is /file-one/additional. So I did:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')

link_text = “”

for a in soup.find_all(‘a’, href=True, text=True):
    link_text = a[‘href’]

print “Link: “ + link_text

But it just prints a blank, nothing. Just Link:. So I tested it out on another site but with a different HTML, and it worked.

What could I be doing wrong? Or is there a possibility that the site intentionally programmed to not return the href?

Thank you in advance and will be sure to upvote/accept answer!

5 Answers5

43

The 'a' tag in your html does not have any text directly, but it contains a 'h3' tag that has text. This means that text is None, and .find_all() fails to select the tag. Generally do not use the text parameter if a tag contains any other html elements except text content.

You can resolve this issue if you use only the tag's name (and the href keyword argument) to select elements. Then add a condition in the loop to check if they contain text.

soup = BeautifulSoup(html, 'html.parser')
links_with_text = []
for a in soup.find_all('a', href=True): 
    if a.text: 
        links_with_text.append(a['href'])

Or you could use a list comprehension, if you prefer one-liners.

links_with_text = [a['href'] for a in soup.find_all('a', href=True) if a.text]

Or you could pass a lambda to .find_all().

tags = soup.find_all(lambda tag: tag.name == 'a' and tag.get('href') and tag.text)

If you want to collect all links whether they have text or not, just select all 'a' tags that have a 'href' attribute. Anchor tags usually have links but that's not a requirement, so I think it's best to use the href argument.

Using .find_all().

links = [a['href'] for a in soup.find_all('a', href=True)]

Using .select() with CSS selectors.

links = [a['href'] for a in soup.select('a[href]')]
t.m.adam
  • 15,106
  • 3
  • 32
  • 52
  • Thought to inform you about a question that I find trouble figuring out myself. I'll be very glad if you give [this post](https://stackoverflow.com/questions/59594692/unable-to-use-https-proxy-within-urllib-request) a go. Thanks. – MITHU Jan 05 '20 at 09:54
  • And if i have to print each of these links, how to do? Because its a list right, not a str.. – Linces Marques Nov 30 '22 at 17:07
  • 1
    @LincesMarques Why don't you use a for loop? `for link in links: print(link)` – t.m.adam Dec 01 '22 at 09:58
6

You can also use attrs to get the href tag with regex search

soup.find('a', href = re.compile(r'[/]([a-z]|[A-Z])\w+')).attrs['href']
Rakshit Vats
  • 61
  • 1
  • 4
  • 1
    Do you know why calling directly `.href` does not work, but `.attrs['href']` works fine? I just spent 15 min to debug this :( – Jean Monet Dec 18 '20 at 22:23
5
  1. First of all, use a different text editor that doesn't use curly quotes.

  2. Second, remove the text=True flag from the soup.find_all

whackamadoodle3000
  • 6,684
  • 4
  • 27
  • 44
3

You could solve this with just a couple lines of gazpacho:


from gazpacho import Soup

html = """\
<div class="file-one">
    <a href="/file-one/additional" class="file-link">
      <h3 class="file-name">File One</h3>
    </a>
    <div class="location">
      Down
    </div>
  </div>
"""

soup = Soup(html)
soup.find("a", {"class": "file-link"}).attrs['href']

Which would output:

'/file-one/additional'
emehex
  • 9,874
  • 10
  • 54
  • 100
-1

A bit late to the party but I had the same issue recently scraping some recipes and got mine printing clean by doing this:

from bs4 import BeautifulSoup
import requests

source = requests.get('url for website')
soup = BeautifulSoup(source, 'lxml')

for article in soup.find_all('article'):
    link = article.find('a', href=True)['href'}
    print(link)
  • 1
    Please question yourself - How does this answer fundamentally differ from the previous and very extensive answers, what is the added value. In addition, please also check your approach for proper functioning. It is not about collecting badges, but about helping others with good answers. – HedgeHog Jun 08 '22 at 06:51
  • 1
    not here for the badges or reputation, was just having the same problem and posted the solution that finally worked for me despite the various other posts. – Joseph Williams Jun 08 '22 at 21:23