0

i am a python beginner and have written a code to download all the links in the url specified. Is there a better way to do this and is the below code correct?

#!/usr/bin/python3

import re
import requests

def get_page(url):
    r = requests.get(url)
    print(r.status_code)
    content = r.text
    return content

if __name__ =="__main__":
    url = 'http://developer.android.com'
    content = get_page(url)
    content_pattern = re.compile('<a href=(.*?)>.*?</a>')
    result = re.findall(content_pattern, content)
    for link in result:
        with open('download.txt', 'wb') as fd:
            for chunk in r.iter_content(chunk_size):
                fd.write(chunk)
nik
  • 576
  • 2
  • 7
  • 14
  • what does the code __name__=="__main__" compare? what does it mean? – nik Jun 26 '14 at 11:37
  • 1
    It means; if this file is a module for importing to the another file or entrance of the program – myildirim Jun 26 '14 at 11:39
  • And also this code seems correct, what are you asking for ? – myildirim Jun 26 '14 at 11:39
  • @myildirim how do i specify the chunk_size ? – nik Jun 26 '14 at 11:42
  • This code snippet uses requests module, you can find in it's document what you search for http://www.python-requests.org/en/v0.14.2/api/ – myildirim Jun 26 '14 at 11:44
  • @myildirim thanks, right now the webpage is getting downloaded i cant get the links in webpage and my goal is to download the stuff in those links. How do i do that? – nik Jun 26 '14 at 11:54

2 Answers2

2

Try this:

from bs4 import BeautifulSoup
import sys
import requests

def get_links(url):

    r = requests.get(url)
    contents = r.content

    soup = BeautifulSoup(contents)
    links =  []
    for link in soup.findAll('a'):
        try:
            links.append(link['href'])
        except KeyError:
            pass
    return links

if __name__ == "__main__":
    url = sys.argv[1]
    print get_links(url)
    sys.exit()
Ayush
  • 167
  • 3
  • 10
1

You may want to investigate the linux wget command which is able to do what you want already. If you really want a python solution then mechanize and beautiful soup can perform the HTTP requests and parse the HTML respectively.

Matthew Franglen
  • 4,441
  • 22
  • 32
  • i want a python solution and beautiful soup is used for parsing. Is it necessary to pass. Why is r.text not enough ? – nik Jun 26 '14 at 11:45
  • You don't want to parse HTML by yourself, especially not with a regex. You will get more reliable results if you use a proper library. It has been covered very well in this answer: http://stackoverflow.com/a/1732454/170865 – Matthew Franglen Jun 26 '14 at 11:47
  • thanks, right now the webpage whose link i provide is getting downloaded i cant get the links in webpage and my goal is to download the stuff in those links. How do i do that? – nik Jun 26 '14 at 11:58