Web crawler that downloads all the links in a webpage

Question

i am a python beginner and have written a code to download all the links in the url specified. Is there a better way to do this and is the below code correct?

#!/usr/bin/python3

import re
import requests

def get_page(url):
    r = requests.get(url)
    print(r.status_code)
    content = r.text
    return content

if __name__ =="__main__":
    url = 'http://developer.android.com'
    content = get_page(url)
    content_pattern = re.compile('<a href=(.*?)>.*?</a>')
    result = re.findall(content_pattern, content)
    for link in result:
        with open('download.txt', 'wb') as fd:
            for chunk in r.iter_content(chunk_size):
                fd.write(chunk)

what does the code __name__=="__main__" compare? what does it mean? — nik, Jun 26 '14 at 11:37
It means; if this file is a module for importing to the another file or entrance of the program — myildirim, Jun 26 '14 at 11:39
This code snippet uses requests module, you can find in it's document what you search for http://www.python-requests.org/en/v0.14.2/api/ — myildirim, Jun 26 '14 at 11:44
@myildirim thanks, right now the webpage is getting downloaded i cant get the links in webpage and my goal is to download the stuff in those links. How do i do that? — nik, Jun 26 '14 at 11:54

score 2 · Answer 1 · answered Jun 27 '14 at 08:12

Try this:

from bs4 import BeautifulSoup
import sys
import requests

def get_links(url):

    r = requests.get(url)
    contents = r.content

    soup = BeautifulSoup(contents)
    links =  []
    for link in soup.findAll('a'):
        try:
            links.append(link['href'])
        except KeyError:
            pass
    return links

if __name__ == "__main__":
    url = sys.argv[1]
    print get_links(url)
    sys.exit()

score 1 · Accepted Answer · answered Jun 26 '14 at 11:43

1

You may want to investigate the linux wget command which is able to do what you want already. If you really want a python solution then mechanize and beautiful soup can perform the HTTP requests and parse the HTML respectively.

answered Jun 26 '14 at 11:43

Matthew Franglen

4,441
22
32

i want a python solution and beautiful soup is used for parsing. Is it necessary to pass. Why is r.text not enough ? – nik Jun 26 '14 at 11:45
You don't want to parse HTML by yourself, especially not with a regex. You will get more reliable results if you use a proper library. It has been covered very well in this answer: http://stackoverflow.com/a/1732454/170865 – Matthew Franglen Jun 26 '14 at 11:47
thanks, right now the webpage whose link i provide is getting downloaded i cant get the links in webpage and my goal is to download the stuff in those links. How do i do that? – nik Jun 26 '14 at 11:58

Web crawler that downloads all the links in a webpage

2 Answers2