1

Looking for a way to exclude image links/links that do not contain any anchor text. The code below gets the job done as far as compiling the data I want, but it also picks up unwanted URLs from some thumbnails/image links on the pages

for url in list_urls:
    browser.get(url)
    soup = BeautifulSoup(browser.page_source,"html.parser")
    for line in soup.find_all('a'):
        href = line.get('href')
        links_with_text.append([url, href])

Images on the pages scraped all have the same format (and they are all under the same div class, "related-content"):

<a href="https://XXXX/"    ><picture class="crp_thumb crp_featured" title="XXXX">
<source type="image/webp" srcset="https://XXXX.jpg.webp"/>
<img width="150" height="150" src="https://XXXX.jpg" alt="XXXX"/>
</picture>

2 Answers2

0

Here are few examples you can use:

  1. select <a> tags that don't contain any text
  2. select <a> tags that don't contain <img> tags
  3. select <a> tags that don't contain any text and <img> tags

txt = '''
<a href="https://XXXX/">
<picture class="crp_thumb crp_featured" title="XXXX">
<source type="image/webp" srcset="https://XXXX.jpg.webp"/>
<img width="150" height="150" src="https://XXXX.jpg" alt="XXXX"/>
</picture>
</a>

<a href="https://XXX">OK LINK</a>
'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(txt, 'html.parser')

# select <a> tags that don't contain any text
for a in soup.find_all(lambda t: t.name == 'a' and t.get_text(strip=True) != ''):
    print(a)

# select <a> tags that don't contain <img> tags
for a in soup.select('a:not(:has(img))'):
    print(a)

# select <a> tags that don't contain any text and <img> tags
for a in soup.find_all(lambda t: t.name == 'a' and t.get_text(strip=True) != '' and not t.find('img')):
    print(a)
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
0

Solutions with SimplifiedDoc.

from simplified_scrapy.simplified_doc import SimplifiedDoc
html='''<a href="https://XXXX/"    ><picture class="crp_thumb crp_featured" title="XXXX">
<source type="image/webp" srcset="https://XXXX.jpg.webp"/>
<img width="150" height="150" src="https://XXXX.jpg" alt="XXXX"/>
</picture></a>'''
doc = SimplifiedDoc(html)
lstA = doc.getElementsByTag('a')
lstImg = doc.getElementsByTag('img')
lstSource = doc.getElementsByTag('source')
print ([a.href for a in lstA])
print ([img.src for img in lstImg])
print ([source.srcset for source in lstSource])
lstA = doc.getElementsByTag('a').notContains('<picture')
print ([a.href for a in lstA])

Result:

['https://XXXX/']
['https://XXXX.jpg']
['https://XXXX.jpg.webp']
[]
dabingsou
  • 2,469
  • 1
  • 5
  • 8