1

I am trying to create a script that scrapes a webpage and downloads any image files found.

My first function is a wget function that reads the webpage and assigns it to a variable. My second function is a RegEx that searches for the 'ssrc=' in a webpages html, below is the function:

def find_image(text):
    '''Find .gif, .jpg and .bmp files'''
    documents = re.findall(r'\ssrc="([^"]+)"', text) 
    count = len(documents)
    print "[+] Total number of file's found: %s" % count
    return '\n'.join([str(x) for x in documents])

The output from this is something like this:

example.jpg
image.gif
http://www.webpage.com/example/file01.bmp

I am trying to write a third function that downloads these files using urllib.urlretrieve(url, filename) but I am not sure how to go about this, mainly because some of the output is absolute paths where as others are relative. I am also unsure how to download these all at same time and download without me having to specify a name and location every time.

lucasnadalutti
  • 5,818
  • 1
  • 28
  • 48
Billy King
  • 13
  • 3
  • 1
    don't parse html with regexes http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – n1c9 Nov 24 '16 at 18:45

2 Answers2

0

Path-Agnostic fetching of resources (Can handle absolute/relative paths) -

from bs4 import BeautifulSoup as bs
import urlparse
from urllib2 import urlopen
from urllib import urlretrieve
import os

def fetch_url(url, out_folder="test/"):
    """Downloads all the images at 'url' to /test/"""
    soup = bs(urlopen(url))
    parsed = list(urlparse.urlparse(url))

    for image in soup.findAll("img"):
        print "Image: %(src)s" % image
        filename = image["src"].split("/")[-1]
        parsed[2] = image["src"]
        outpath = os.path.join(out_folder, filename)
        if image["src"].lower().startswith("http"):
            urlretrieve(image["src"], outpath)
        else:
            urlretrieve(urlparse.urlunparse(parsed), outpath)

fetch_url('http://www.w3schools.com/html/')
Vivek Kalyanarangan
  • 8,951
  • 1
  • 23
  • 42
0

I can't write you the complete code and I'm sure that's not what you would want as well, but here are some hints:

1) Do not parse random HTML pages with regex, there are quite a few parsers made for that. I suggest BeautifulSoup. You will filter all img elements and get their src values.

2) With the src values at hand, you download your files the way you are already doing. About the relative/absolute problem, use the urlparse module, as per this SO answer. The idea is to join the src of the image with the URL from which you downloaded the HTML. If the src is already absolute, it will remain that way.

3) As for downloading them all, simply iterate over a list of the webpages you want to download images from and do steps 1 and 2 for each image in each page. When you say "at the same time", you probably mean to download them asynchronously. In that case, I suggest downloading each webpage in one thread.

Community
  • 1
  • 1
lucasnadalutti
  • 5,818
  • 1
  • 28
  • 48