8

I'm trying to retrieve the source of a webpage, including any images. At the moment I have this:

import urllib

page = urllib.urlretrieve('http://127.0.0.1/myurl.php', 'urlgot.php')
print urlgot.php

which retrieves the source fine, but I also need to download any linked images.

I was thinking I could create a regular expression which searched for img src or similar in the downloaded source; however, I was wondering if there was urllib function that would retrieve the images as well? Similar to the wget command of:

wget -r --no-parent http://127.0.0.1/myurl.php

I don't want to use the os module and run the wget, as I want the script to run on all systems. For this reason I can't use any third party modules either.

Any help is much appreciated! Thanks

Jingo
  • 768
  • 1
  • 10
  • 23
  • Good luck. You should also ask how to package Python packages, and user your system's package manager. – Keith Sep 06 '11 at 00:10

2 Answers2

7

Don't use regex when there is a perfectly good parser built in to Python:

from urllib.request import urlretrieve  # Py2: from urllib
from html.parser import HTMLParser      # Py2: from HTMLParser

base_url = 'http://127.0.0.1/'

class ImgParser(HTMLParser):
    def __init__(self, *args, **kwargs):
        self.downloads = []
        HTMLParser.__init__(self, *args, **kwargs)

    def handle_starttag(self, tag, attrs):
        if tag == 'img':
            for attr in attrs:
                if attr[0] == 'src':
                    self.downloads.append(attr[1])

parser = ImgParser()
with open('test.html') as f:
    # instead you could feed it the original url obj directly
    parser.feed(f.read())

for path in parser.downloads:
    url = base_url + path
    print(url)
    urlretrieve(url, path)
Gringo Suave
  • 29,931
  • 6
  • 88
  • 75
3

Use BeautifulSoup to parse the returned HTML and search for image links. You might also need to recursively fetch frames and iframes.

Marcelo Cantos
  • 181,030
  • 38
  • 327
  • 365
  • forgive my ignorance but would that not mean it wouldn't be able to run on someone's computer who didn't have the beautiful soup module installed? – Jingo Sep 05 '11 at 21:28
  • You need to distribute BeautifulSoap the library with your application. It should be not very difficult, unless you are dealing with native extensions which on Windows tend to have .exe installers. – Mikko Ohtamaa Sep 05 '11 at 21:49
  • thanks but that's not really what i'm looking for =( - i'll just use a regex to parse for img tags. cheers – Jingo Sep 05 '11 at 22:37
  • @Jingo: That's fine, but be sure to deal properly with varying order of img attributes and multi-line img elements. You may also want to avoid img elements inside comments and strings. – Marcelo Cantos Sep 05 '11 at 22:42
  • 1
    @Jingo: be warned. ­[HTML is **not** a regular language.](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – André Caron Sep 06 '11 at 00:02