0

There are couple of images and one word document in the given page source , and I am trying to install all of them by matching them with the regex I wrote "\w+\.\w{1,4}" is the regex suitable or not?

Is this piece of code right retrieve = urllib.urlretrieve(i,'C:\Python27')

Here is my code:

import sys, urllib, re

def retriev_files(page):
    open_page = urllib.urlopen(page)
    contents = open_page.read()
    find_files = re.findall("\w+\.\w{1,4}",contents)
    for i in find_files:
        try:
            print " retrieving %s ... " %i
            retrieve = urllib.urlretrieve(i,'C:\Python27')
            print " done !! "
            return retrieve

        except urllib.urlretrieve as err:
            pass

def main():
    print retriev_files("http://www.soc.napier.ac.uk/~40001507/CSN08115/cw_webpage/index.html")
if __name__ == "__main__":
    main()
Maximilian Peters
  • 30,348
  • 12
  • 86
  • 99
ibr2
  • 51
  • 9

1 Answers1

0

There are several issues with your code

  • your regex will capture anything which is some characters followed by a followed by one or more characters, that might icon_clown.gif but also r.macf which is part of the email address. Have a look at this famous answer here to get an idea why RegEx is not a good approach for parsing HTML. Try something like beautifulsoup or preferably Selenium for getting data from web pages.

  • return retrieve will only retrieve the first image and then exit your function. You could define a list retrieved_images, then use retrieved_images.append(retrieve[0]) and finally return the list

  • urlretrieve returns a tuple where the first element is the filename (the reason for [0] in the line above). The second argument needs to be a filename and not a path.
  • Your regex would find some filenames, e.g. it would work for icon_clown.gif, but it doesn't give you the full path, i.e. you would need to merge the URL from page with your RegEx match, e.g.

The following line might work for most cases, e.g. when only the relative image URL is given.

urllib.urlretrieve(path[0:path.rfind('/')] + '/' file)
Community
  • 1
  • 1
Maximilian Peters
  • 30,348
  • 12
  • 86
  • 99