0

so I've downloaded the HTML of a web page. I'm supposed to extract all of the links from the HTML and output them. Here is my code

f = open('html.py','r')
heb = f.readlines()
arry = []
if 'href' in heb:
    arry = arry.append(href)

    print(arry)

I'm trying to make a list of the links and output it, but honestly I'm pretty lost. Can someone point me in the right direction? I was thinking regex probably is the way to go thanks

Will Da Silva
  • 6,386
  • 2
  • 27
  • 52
Jake Baldwin
  • 1
  • 1
  • 3
  • 1
    Don't use regex on html! Use an HTML parser like BeautifulSoup. – kevinsa5 Jun 20 '17 at 01:39
  • 1
    Possible duplicate of [retrieve links from web page using python and BeautifulSoup](https://stackoverflow.com/questions/1080411/retrieve-links-from-web-page-using-python-and-beautifulsoup) – Teemu Risikko Jun 20 '17 at 05:57

1 Answers1

3

You can use Beautiful Soup (which you'll need to install, e.g. with pip install BeautifulSoup4):

import bs4

with open("my-file.html") as f:
    soup = bs4.BeautifulSoup(f)

links = [link['href'] for link in soup('a') if 'href' in link.attrs]
icktoofay
  • 126,289
  • 21
  • 250
  • 231