BeautifulSoup and if/else statements

Question

A am learning how to use BeautifulSoup and I have run into an issue with double printing in a loop I have written.

Any insight would be greatly appreciated!

from bs4 import BeautifulSoup
import requests
import re


page = 'https://news.google.com/news/headlines?gl=US&ned=us&hl=en'                 #main page

#url = raw_input("Enter a website to extract the URL's from: ")
r = requests.get(page)                              #requests html document
data = r.text                                       #set data = to html text
soup = BeautifulSoup(data, "html.parser")           #parse data with BS


for link in soup.find_all('a'):
    #if contains /news/
    if ('/news/' in link.get('href')):
        print(link.get('href'))

Examples:

for link in soup.find_all('a'):
#if contains cointelegraph/news/
#if ('https://cointelegraph.com/news/' in link.get('href')):
url = link.get('href')                          #local var store url
if '/news/' in url:
    print(url)
    print(count)
    count += 1

if count == 5:
    break

output:

    https://cointelegraph.com/news/woman-in-denmark-imprisoned-for-hiring-hitman-using-bitcoin
0
https://cointelegraph.com/news/ethereum-price-hits-all-time-high-of-750-following-speed-boost
1
https://cointelegraph.com/news/ethereum-price-hits-all-time-high-of-750-following-speed-boost
2
https://cointelegraph.com/news/senior-vp-says-ebay-seriously-considering-bitcoin-integration
3
https://cointelegraph.com/news/senior-vp-says-ebay-seriously-considering-bitcoin-integration
4

For some reason my code keeps printing out the same url twice...

maybe there are two elements on page so you get two prints. ie link with title and the same link with image. — furas, Dec 16 '17 at 05:59
your code seems to run as is. please show what you expect as output — ShpielMeister, Dec 16 '17 at 06:02

score 0 · Answer 1 · answered Dec 16 '17 at 06:53

Based on your code and the provided link there seems to be duplicates in the results of BeautifulSoup find_all search. The html structure needs to be checked to see why duplicates are returned (check the find_all search options to filter some in the documentation. But if you want a quick fix and want to remove the duplicates from the printed results you can use the modified loop with a set as below to keep track of seen entries (based on this).

In [78]: l = [link.get('href') for link in soup.find_all('a') if '/news/' in link.get('href')]

In [79]: any(l.count(x) > 1 for x in l)                                                                                                              
Out[79]: True

The above output shows duplicate exists in the list. Now to remove them use something like

seen = set()                                                                                                                                

for link in soup.find_all('a'):                                                                                                             

    lhref = link.get('href')
    if '/news/' in lhref and lhref not in seen:
        print lhref
        seen.add(lhref)

BeautifulSoup and if/else statements

1 Answers1