6

I want to download all files from an internet page, actually all the image files. I found the 'urllib' module to be what I need. There seems to be a method to download a file, if you know the filename, but I don't.

urllib.urlretrieve('http://www.example.com/page', 'myfile.jpg')

Is there a method to download all the files from the page and maybe return a list?

Brock123
  • 183
  • 1
  • 3
  • 10
  • 1
    possible duplicate of [Web scraping with Python](http://stackoverflow.com/questions/2081586/web-scraping-with-python) – Mat Oct 01 '11 at 08:01
  • Can't find much info. Perhaps a small example script? – Brock123 Oct 01 '11 at 08:19
  • Brock123 did you read the link @Mat posted above? It points you toward [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/) for scraping the page, which you can use to find all the URLs of the files you then wish to download. – John Keyes Oct 01 '11 at 10:24

1 Answers1

7

Here's a little example to get you started with using BeautifulSoup for this kind of exercise - you give this script a URL, and it will print out the URLs of images that are referenced from that page in the src attribute of img tags that end with jpg or png:

import sys, urllib, re, urlparse
from BeautifulSoup import BeautifulSoup

if not len(sys.argv) == 2:
    print >> sys.stderr, "Usage: %s <URL>" % (sys.argv[0],)
    sys.exit(1)

url = sys.argv[1]

f = urllib.urlopen(url)
soup = BeautifulSoup(f)
for i in soup.findAll('img', attrs={'src': re.compile('(?i)(jpg|png)$')}):
    full_url = urlparse.urljoin(url, i['src'])
    print "image URL: ", full_url

Then you can use urllib.urlretrieve to download each of the images pointed to by full_url, but at that stage you have to decide how to name them and what to do with the downloaded images, which isn't specified in your question.

Mark Longair
  • 446,582
  • 72
  • 411
  • 327