18

I have to write a web crawler in Python. I don't know how to parse a page and extract the URLs from HTML. Where should I go and study to write such a program?

In other words, is there a simple python program which can be used as a template for a generic web crawler? Ideally it should use modules which are relatively simple to use and it should include plenty of comments to describe what each line of code is doing.

SiHa
  • 7,830
  • 13
  • 34
  • 43
user2189704
  • 223
  • 1
  • 2
  • 3

5 Answers5

22

Look at example code below. The script extracts html code of a web page (here Python home page) and extracts all the links in that page. Hope this helps.

#!/usr/bin/env python

import requests
from bs4 import BeautifulSoup

url = "http://www.python.org"
response = requests.get(url)
# parse html
page = str(BeautifulSoup(response.content))


def getURL(page):
    """

    :param page: html of web page (here: Python home page) 
    :return: urls in that page 
    """
    start_link = page.find("a href")
    if start_link == -1:
        return None, 0
    start_quote = page.find('"', start_link)
    end_quote = page.find('"', start_quote + 1)
    url = page[start_quote + 1: end_quote]
    return url, end_quote

while True:
    url, n = getURL(page)
    page = page[n:]
    if url:
        print(url)
    else:
        break

Output:

/
#left-hand-navigation
#content-body
/search
/about/
/news/
/doc/
/download/
/getit/
/community/
/psf/
http://docs.python.org/devguide/
/about/help/
http://pypi.python.org/pypi
/download/releases/2.7.3/
http://docs.python.org/2/
/ftp/python/2.7.3/python-2.7.3.msi
/ftp/python/2.7.3/Python-2.7.3.tar.bz2
/download/releases/3.3.0/
http://docs.python.org/3/
/ftp/python/3.3.0/python-3.3.0.msi
/ftp/python/3.3.0/Python-3.3.0.tar.bz2
/community/jobs/
/community/merchandise/
/psf/donations/
http://wiki.python.org/moin/Languages
http://wiki.python.org/moin/Languages
http://www.google.com/calendar/ical/b6v58qvojllt0i6ql654r1vh00%40group.calendar.google.com/public/basic.ics
http://www.google.com/calendar/ical/j7gov1cmnqr9tvg14k621j7t5c%40group.calendar.google.com/public/basic.ics
http://pycon.org/#calendar
http://www.google.com/calendar/ical/3haig2m9msslkpf2tn1h56nn9g%40group.calendar.google.com/public/basic.ics
http://pycon.org/#calendar
http://www.psfmember.org

...

Halee
  • 492
  • 9
  • 15
Shankar
  • 3,496
  • 6
  • 25
  • 40
19

You can use BeautifulSoup as many have also stated. It can parse HTML,XML etc. To see some of it's features, see here.

Example:

import urllib2
from bs4 import BeautifulSoup
url = 'http://www.google.co.in/'

conn = urllib2.urlopen(url)
html = conn.read()

soup = BeautifulSoup(html)
links = soup.find_all('a')

for tag in links:
    link = tag.get('href',None)
    if link is not None:
        print link
pradyunsg
  • 18,287
  • 11
  • 43
  • 96
6
import sys
import re
import urllib2
import urlparse
tocrawl = set(["http://www.facebook.com/"])
crawled = set([])
keywordregex = re.compile('<meta\sname=["\']keywords["\']\scontent=["\'](.*?)["\']\s/>')
linkregex = re.compile('<a\s*href=[\'|"](.*?)[\'"].*?>')

while 1:
    try:
        crawling = tocrawl.pop()
        print crawling
    except KeyError:
        raise StopIteration
    url = urlparse.urlparse(crawling)
    try:
        response = urllib2.urlopen(crawling)
    except:
        continue
    msg = response.read()
    startPos = msg.find('<title>')
    if startPos != -1:
        endPos = msg.find('</title>', startPos+7)
        if endPos != -1:
            title = msg[startPos+7:endPos]
            print title
    keywordlist = keywordregex.findall(msg)
    if len(keywordlist) > 0:
        keywordlist = keywordlist[0]
        keywordlist = keywordlist.split(", ")
        print keywordlist
    links = linkregex.findall(msg)
    crawled.add(crawling)
    for link in (links.pop(0) for _ in xrange(len(links))):
        if link.startswith('/'):
            link = 'http://' + url[1] + link
        elif link.startswith('#'):
            link = 'http://' + url[1] + url[2] + link
        elif not link.startswith('http'):
            link = 'http://' + url[1] + '/' + link
        if link not in crawled:
            tocrawl.add(link)

Referenced to: Python Web Crawler in Less Than 50 Lines (Slow or no longer works, does not load for me)

Jared Burrows
  • 54,294
  • 25
  • 151
  • 185
Scy
  • 488
  • 3
  • 11
5

You can use beautifulsoup. Follow the documentation and see what matches your requirements. The documentation contains code snippets for how to extract URL's as well.

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)

soup.find_all('a') # Finds all hrefs from the html doc.
Sushant Gupta
  • 8,980
  • 5
  • 43
  • 48
3

With parsing pages, check out the BeautifulSoup module. It's simple to use and allows you to parse pages with HTML. You can extract URLs from the HTML simply by doing str.find('a')

Don't use regular expressions for parsing HTML

Community
  • 1
  • 1
TerryA
  • 58,805
  • 11
  • 114
  • 143