1

Possible Duplicate:
How to download a file in python

I'm playing with Python for doing some crawling stuff. I do know there is urllib.urlopen("http://XXXX") That can help me to get the html for target website. However, The link to the original image in that webpage will usually make the image in the backup page unavailable. I am wondering is there a way that can also save the image in the local space, then we can read the full content on the website without internet connection. It's like back up the whole webpage, but I'm not sure is there any way to do that in Python. Also, if it can get rid of the advertisement stuff, it will be more awesome though. Thanks.

Community
  • 1
  • 1
JLTChiu
  • 983
  • 3
  • 12
  • 28
  • 1
    Do you really need to do this in python? It is much easier to do what you want using `wget -p`. This will also retrieve images and other links that are required to display the page. You can play with `wget -L` or `wger -np` to remove the advertising stuff. – Hans Then Sep 30 '12 at 20:42

1 Answers1

1

If you're looking to backup a single webpage, you're well on your way.

Since you mention crawling, if you want to backup an entire website, you'll need to do some real crawling and you'll need scrapy for that.

There are several ways of downloading files off the interwebs, just see these questions:

  1. Python File Download
  2. How to- download a file in python
  3. Automate file download from http using python

Hope this helps

Community
  • 1
  • 1
inspectorG4dget
  • 110,290
  • 27
  • 149
  • 241
  • I see, thanks. Is it possible to save the whole webpage into a database instead of a file on my desktop? Although that might seems a little bit strange, because I don't know how to save the relation between html and images it has to the database... – JLTChiu Sep 30 '12 at 21:28
  • Why would you want to use a DB? A well written html file should know where to find images in the filesystem. – inspectorG4dget Sep 30 '12 at 21:37
  • If you view the .html file on your drive, it will properly link to the images (which are also local) using wget (probably scrapy, I just haven't used it myself) – ninMonkey Oct 01 '12 at 02:06