1

Write a function that opens a web page and returns a dictionary of all thelinks and their text on that page. A link is defined by an HTML tag that looks like

< ahref="http://my.computer.com/some/file.html">link text < /a>

The link is everything in quotes after thehref=, and the text is everything between the > and the . For the example above, the entry in a dictionary would look like:

"{"http:// my.computer.com/some/file.html" : " link text ", ...}"

Here's my code so far I've been stuck on for a few hours. How do I accomplish this problem?

import urllib.request


def Urls(webpage):
    url = webpage
    page = urllib.request.urlopen(url)
    url_list = {}
    for line in page:
        if '<a href=' in line:
Mahesh Khond
  • 1,297
  • 1
  • 14
  • 31

4 Answers4

2

While the answers suggesting to tackle this problem with regular expressions might work, they'll fail (unless you take measures) when e.g. the link is split over several lines. E.g. this is perfectly valid HTML:

<a
href="../path">link</a>

There are some other edge cases to consider. In general, HTML cannot be parsed with regular expressions and there's some excellent prose written about that. By the way, the construction "a href" in line is a less powerful form of regular expression, that just searches in a line and has the same drawbacks.

Instead, you should be looking into libraries that parse HTML as a properly formatted XML-document. In Python, the go-to library would be beautifulsoup. With it, you can quickly get all links in a webpage, e.g. like this:

import urllib
from bs4 import BeautifulSoup
url = "http://www.imdb.com/"
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
soup.find_all('a') # returns all links as a list
links = [a['href'] for a in soup.find_all('a', href=True)] # some anchors might have no href attribute, by specifying href=True, you'll get only those that do

The beautifulsoup documentation is very well-documented, with tons of examples. Well worth a read.

Community
  • 1
  • 1
Oliver W.
  • 13,169
  • 3
  • 37
  • 50
0
def Urls(webpage):
    url = webpage
    page = urllib.request.urlopen(url)
    url_list = {}
    for line in page:
        if '<a href=' in line:
            try:
                url = line.split('<a href="')[-1].split('">')[0]
                txt = line.split('<a href="')[-1].split('">')[-1].split('< /a>')[0]
                url_list[url] = txt
             except:
                 pass
     return url_list
TheLazyScripter
  • 2,541
  • 1
  • 10
  • 19
0
r=re.compile("<\s*a\s*href=\"(.*?)\">(.*?)<\s*/a\s*>")
list = r.findall(line)
for tuple in list:
    url_list[tuple[0]] = tuple[1]
Vikas Madhusudana
  • 1,482
  • 1
  • 10
  • 20
  • Please don't recommend regular expressions to parse HTML. There are stable libraries for this task. –  Jun 01 '16 at 08:29
0

Using requests and SoupStrainer for simplicity/efficiency:

import requests
from bs4 import BeautifulSoup, SoupStrainer

def get_urls(webpage):
    res = requests.get(webpage)
    links = [l for l in BeautifulSoup(res.text, parseOnlyThese=SoupStrainer('a'))             
             if l.has_attr('href')]
    return links
Jamie Bull
  • 12,889
  • 15
  • 77
  • 116