1

I am writing a script to only extract the hyperlinks from a webpage. This is what I have so far:

import bs4 as bs
import urllib.request

source = urllib.request.urlopen('http://www.soc.napier.ac.uk/~40009856/CW/').read()

soup = bs.BeautifulSoup(source, 'lxml')

#for paragraph in soup.find_all('p'):
 #   print(paragraph.string)

for url in soup.find_all('a'):
    print(url.get('href'))

I want only hyperlinks to other webpages and not links to PDFs and email addresses as well. As is given in the output

How do I specify to only return hyperlinks?

FatihAkici
  • 4,679
  • 2
  • 31
  • 48
Wayne
  • 61
  • 1
  • 3
  • 1
    What hinders you to analyze the scraped href? if something ends with .pdf you dont want it. if it starts with file:// you dont want it. if it ends on / or .html you probably want it. – Patrick Artner Nov 30 '17 at 17:59
  • Possible duplicate of [retrieve links from web page using python and BeautifulSoup](https://stackoverflow.com/questions/1080411/retrieve-links-from-web-page-using-python-and-beautifulsoup) – gionni Nov 30 '17 at 17:59
  • [mimetypes](https://stackoverflow.com/questions/21515098/how-to-check-the-url-is-either-web-page-link-or-file-link-in-python) might help you. – FatihAkici Dec 30 '17 at 04:41

0 Answers0