How to extract URLs from an HTML page in Python

Question

I have to write a web crawler in Python. I don't know how to parse a page and extract the URLs from HTML. Where should I go and study to write such a program?

In other words, is there a simple python program which can be used as a template for a generic web crawler? Ideally it should use modules which are relatively simple to use and it should include plenty of comments to describe what each line of code is doing.

score 22 · Answer 1 · edited Apr 01 '19 at 02:46

Look at example code below. The script extracts html code of a web page (here Python home page) and extracts all the links in that page. Hope this helps.

#!/usr/bin/env python

import requests
from bs4 import BeautifulSoup

url = "http://www.python.org"
response = requests.get(url)
# parse html
page = str(BeautifulSoup(response.content))


def getURL(page):
    """

    :param page: html of web page (here: Python home page) 
    :return: urls in that page 
    """
    start_link = page.find("a href")
    if start_link == -1:
        return None, 0
    start_quote = page.find('"', start_link)
    end_quote = page.find('"', start_quote + 1)
    url = page[start_quote + 1: end_quote]
    return url, end_quote

while True:
    url, n = getURL(page)
    page = page[n:]
    if url:
        print(url)
    else:
        break

Output:

/
#left-hand-navigation
#content-body
/search
/about/
/news/
/doc/
/download/
/getit/
/community/
/psf/
http://docs.python.org/devguide/
/about/help/
http://pypi.python.org/pypi
/download/releases/2.7.3/
http://docs.python.org/2/
/ftp/python/2.7.3/python-2.7.3.msi
/ftp/python/2.7.3/Python-2.7.3.tar.bz2
/download/releases/3.3.0/
http://docs.python.org/3/
/ftp/python/3.3.0/python-3.3.0.msi
/ftp/python/3.3.0/Python-3.3.0.tar.bz2
/community/jobs/
/community/merchandise/
/psf/donations/
http://wiki.python.org/moin/Languages
http://wiki.python.org/moin/Languages
http://www.google.com/calendar/ical/b6v58qvojllt0i6ql654r1vh00%40group.calendar.google.com/public/basic.ics
http://www.google.com/calendar/ical/j7gov1cmnqr9tvg14k621j7t5c%40group.calendar.google.com/public/basic.ics
http://pycon.org/#calendar
http://www.google.com/calendar/ical/3haig2m9msslkpf2tn1h56nn9g%40group.calendar.google.com/public/basic.ics
http://pycon.org/#calendar
http://www.psfmember.org

...

This answer will only locate URLs tagged with . We frequently encounter plain text URLs in our email scraping applications and needed a solution that found both tagged and plaintext URLs. — Doug Bower, Aug 23 '21 at 19:07

pradyunsg · Answer 2 · 2015-05-10T06:25:31.857

19

You can use BeautifulSoup as many have also stated. It can parse HTML,XML etc. To see some of it's features, see here.

Example:

import urllib2
from bs4 import BeautifulSoup
url = 'http://www.google.co.in/'

conn = urllib2.urlopen(url)
html = conn.read()

soup = BeautifulSoup(html)
links = soup.find_all('a')

for tag in links:
    link = tag.get('href',None)
    if link is not None:
        print link

edited May 10 '15 at 06:25

answered Mar 20 '13 at 08:08

pradyunsg

18,287
11
43
96

score 6 · Answer 3 · edited Oct 01 '14 at 14:59

import sys
import re
import urllib2
import urlparse
tocrawl = set(["http://www.facebook.com/"])
crawled = set([])
keywordregex = re.compile('<meta\sname=["\']keywords["\']\scontent=["\'](.*?)["\']\s/>')
linkregex = re.compile('<a\s*href=[\'|"](.*?)[\'"].*?>')

while 1:
    try:
        crawling = tocrawl.pop()
        print crawling
    except KeyError:
        raise StopIteration
    url = urlparse.urlparse(crawling)
    try:
        response = urllib2.urlopen(crawling)
    except:
        continue
    msg = response.read()
    startPos = msg.find('<title>')
    if startPos != -1:
        endPos = msg.find('</title>', startPos+7)
        if endPos != -1:
            title = msg[startPos+7:endPos]
            print title
    keywordlist = keywordregex.findall(msg)
    if len(keywordlist) > 0:
        keywordlist = keywordlist[0]
        keywordlist = keywordlist.split(", ")
        print keywordlist
    links = linkregex.findall(msg)
    crawled.add(crawling)
    for link in (links.pop(0) for _ in xrange(len(links))):
        if link.startswith('/'):
            link = 'http://' + url[1] + link
        elif link.startswith('#'):
            link = 'http://' + url[1] + url[2] + link
        elif not link.startswith('http'):
            link = 'http://' + url[1] + '/' + link
        if link not in crawled:
            tocrawl.add(link)

Referenced to: Python Web Crawler in Less Than 50 Lines (Slow or no longer works, does not load for me)

score 5 · Answer 4 · answered Mar 20 '13 at 07:28

You can use beautifulsoup. Follow the documentation and see what matches your requirements. The documentation contains code snippets for how to extract URL's as well.

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)

soup.find_all('a') # Finds all hrefs from the html doc.

score 3 · Answer 5 · edited May 23 '17 at 11:46

3

With parsing pages, check out the BeautifulSoup module. It's simple to use and allows you to parse pages with HTML. You can extract URLs from the HTML simply by doing str.find('a')

Don't use regular expressions for parsing HTML

edited May 23 '17 at 11:46

Community

1
1

answered Mar 20 '13 at 07:27

TerryA

58,805
11
114
143

How to extract URLs from an HTML page in Python

5 Answers5

Linked