2

I am trying to crawl HTML source with Python using BeautifulSoup.
I need to get the href of specific link <a> tags.

This is my test code. I want to get links <a href="/example/test/link/activity1~10"target="_blank">

<div class="listArea">
   <div class="activity_sticky" id="activity">
   .
   .
   </div>
   <div class="activity_content activity_loaded">
      <div class="activity-list-item activity_item__1fhpg">
         <div class="activity-list-item_activity__3FmEX">
            <div>...</div>
            <a href="/example/test/link/activity1" target="_blank">
               <div class="activity-list-item_addr">
                  <span> 0x1292311</span>
               </div>
            </a>
         </div>
      </div>
      <div class="activity-list-item activity_item__1fhpg">
         <div class="activity-list-item_activity__3FmEX">
            <div>...</div>
            <a href="/example/test/link/activity2" target="_blank">
               <div class="activity-list-item_addr">
                  <span> 0x1292312</span>
               </div>
            </a>
         </div>
      </div>
      .
      .
      .
   </div>
</div>

Jan Wilamowski
  • 3,308
  • 2
  • 10
  • 23
taranndus
  • 29
  • 2
  • What have you tried? The BeautifulSoup documentation has many examples, including this situation. – Tim Roberts Feb 21 '22 at 05:57
  • Does this answer your question? [retrieve links from web page using python and BeautifulSoup](https://stackoverflow.com/questions/1080411/retrieve-links-from-web-page-using-python-and-beautifulsoup) – nathan liang Feb 21 '22 at 06:02

4 Answers4

2

Check the main page of the bs4 documentation:

for link in soup.find_all('a'):
    print(link.get('href'))
nathan liang
  • 1,000
  • 2
  • 11
  • 22
0

This is a code for the problem. You should find the all <a></a>, then to getting the value of href.

soup = BeautifulSoup(html, 'html.parser')
for i in soup.find_all('a'):
    if i['target'] == "_blank":
        print(i['href'])

Hope my answer could help you.

Mason Ma
  • 41
  • 4
0

Select the <a> specific - lternative to @Mason Ma answer you can also use css selectors:

soup.select('.activity_content a')]

or by its attribute target -

soup.select('.activity_content a[target="_blank"]')

Example

Will give you a list of links, matching your condition:

import requests
from bs4 import BeautifulSoup

html = '''
<div class="activity_content activity_loaded">
      <div class="activity-list-item activity_item__1fhpg">
         <div class="activity-list-item_activity__3FmEX">
            <div>...</div>
            <a href="/example/test/link/activity1" target="_blank">
               <div class="activity-list-item_addr">
                  <span> 0x1292311</span>
               </div>
            </a>
         </div>
      </div>
      <div class="activity-list-item activity_item__1fhpg">
         <div class="activity-list-item_activity__3FmEX">
            <div>...</div>
            <a href="/example/test/link/activity2" target="_blank">
               <div class="activity-list-item_addr">
                  <span> 0x1292312</span>
               </div>
            </a>
         </div>
      </div>
'''
soup = BeautifulSoup(html)

[x['href'] for x in soup.select('.activity_content a[target="_blank"]')]

Output

['/example/test/link/activity1', '/example/test/link/activity2']
HedgeHog
  • 22,146
  • 4
  • 14
  • 36
0

Based on my understanding of your question, you're trying to extract the links (href) from anchor tags where the target value is _blank. You can do this by searching for all anchor tags then narrowing down to those whose target == '_blank'

links = soup.findAll('a', attrs = {'target' : '_blank'})
for link in links:
    print(link.get('href'))