How do I find the hyperlinks of a webpage when using BeautifulSoup and Python 3?

Asked Nov 30 '17 at 17:55

Active Dec 30 '17 at 08:11

Viewed 47 times

I am writing a script to only extract the hyperlinks from a webpage. This is what I have so far:

import bs4 as bs
import urllib.request

source = urllib.request.urlopen('http://www.soc.napier.ac.uk/~40009856/CW/').read()

soup = bs.BeautifulSoup(source, 'lxml')

#for paragraph in soup.find_all('p'):
 #   print(paragraph.string)

for url in soup.find_all('a'):
    print(url.get('href'))

I want only hyperlinks to other webpages and not links to PDFs and email addresses as well. As is given in the output

How do I specify to only return hyperlinks?

edited Dec 30 '17 at 08:11

FatihAkici

4,679
2
31
48

asked Nov 30 '17 at 17:55

Wayne

1

What hinders you to analyze the scraped href? if something ends with .pdf you dont want it. if it starts with file:// you dont want it. if it ends on / or .html you probably want it. – Patrick Artner Nov 30 '17 at 17:59
Possible duplicate of [retrieve links from web page using python and BeautifulSoup](https://stackoverflow.com/questions/1080411/retrieve-links-from-web-page-using-python-and-beautifulsoup) – gionni Nov 30 '17 at 17:59
[mimetypes](https://stackoverflow.com/questions/21515098/how-to-check-the-url-is-either-web-page-link-or-file-link-in-python) might help you. – FatihAkici Dec 30 '17 at 04:41

How do I find the hyperlinks of a webpage when using BeautifulSoup and Python 3?

0 Answers0