1

This is probably a very simple task, but I cannot find any help. I have a website that takes the form www.xyz.com/somestuff/ID. I have a list of the IDs I need information from. I was hoping to have a simple script to go one the site and download the (complete) web page for each ID in a simple form ID_whatever_the_default_save_name_is in a specific folder.

Can I run a simple python script to do this for me? I can do it by hand, it is only 75 different pages, but I was hoping to use this to learn how to do things like this in the future.

CJ12
  • 487
  • 2
  • 10
  • 28

3 Answers3

0

Mechanize is a great package for crawling the web with python. A simple example for your issue would be:

import mechanize

br = mechanize.Browser()
response = br.open("www.xyz.com/somestuff/ID")
print response

This simply grabs your url and prints the response from the server.

iCanHasFay
  • 662
  • 5
  • 12
  • Thanks for the start, but I get an invalid syntax, and how can I load a list of IDs into this to have them saved in the format 'ID_webpage' in the same folder? – CJ12 Aug 29 '13 at 04:56
  • I'm assuming you're getting that syntax error on the import statement, as mechanize does not come default with python. As for the rest, generally it is frowned upon on SO to ask for help without providing some code to show what you've done so far so i will just provide a general outline. Store your ID's as a list, loop through that list and place the above code in the loop while changing 'ID' in the open line to the current ID that is being iterated in the loop, then write that response to a file. – iCanHasFay Aug 29 '13 at 05:11
  • Thanks, I will look into mechanize. The big part of my question is for it to save the web page as a complete web page. This does not seem to do that. Would that be an edit to the code? – CJ12 Aug 29 '13 at 05:48
  • ICanHasFay: I installed mechanize using "pip install ..." on Python 3.3.1 and I also get the syntax error: Invalid Syntax . I don't think it is an import statement error. – Joe Dec 09 '13 at 19:20
  • Number 1 under the faq page on the Mechanize site (http://wwwsearch.sourceforge.net/mechanize/faq.html) states "Python 2.4, 2.5, 2.6, or 2.7. Python 3 is not yet supported." – iCanHasFay Dec 11 '13 at 06:32
0

This can be done simply in python using the urllib module. Here is a simple example in Python 3:

import urllib.request

url = 'www.xyz.com/somestuff/ID'
req = urllib.request.Request(url)
page = urllib.request.urlopen(req)
src = page.readall()
print(src)

For more info on the urllib module -> http://docs.python.org/3.3/library/urllib.html

  • Thanks, I am looking for the final product to be saved as a complete webpage html file that I can view offline. Also, I want to feed a list of IDs and have them saved in the same place with the file name ID_default_save_name – CJ12 Aug 29 '13 at 04:58
-2

Do you want just the html code for the website? If so, just create a url variable with the host site and add the page number as you go. I'll do this for an example with http://www.notalwaysright.com

import urllib.request

url = "http://www.notalwaysright.com/page/"

for x in range(1, 71):
    newurl = url + x
    response = urllib.request.urlopen(newurl)
    with open("Page/" + x, "a") as p:
        p.writelines(reponse.read())
rassa45
  • 3,482
  • 1
  • 29
  • 43