2

I am trying to download all images from a URL using a regular expression.

I currently only need the URL of a image, I am using

def urlimage(text):
    '''finds image url'''
    imageurl = []
    imageurl = re.findall(r'https?:\/\/.*\.(?:png|jpg)', text)
    return imageurl

currently this is not finding any image urls. Is it there an issue with my regex or am I going about this wrong?

2brflow
  • 31
  • 1
  • 2
  • So, do all URIs start with `http://` or `https://`, or are some of them relative URIs? Is this an HTML document, should you be using BeautifulSoup or something like that? Finally, the `\.*` should probably be `\.*?`. – Dietrich Epp Nov 26 '16 at 00:37
  • ObLink: [Don't use regex to parse HTML](http://stackoverflow.com/a/1732454/4014959) – PM 2Ring Nov 26 '16 at 02:04

2 Answers2

8

As mentioned in the comments a relative URL would make things tricky in which case using something like BeautifulSoup harder. That said, If a site serves over both http and https then they may not include the protocol (like //example.com/image.png in their markup either.

So then you'd want to tweak your regex to something like this:

def get_url_images_in_text(text):
    '''finds image urls'''
    return re.findall(r'(?:http\:|https\:)?\/\/.*\.(?:png|jpg)', text)

A full example of what I think you're trying to do:

import re
import requests

def get_url_images_in_text(text):
    '''finds image urls'''
    urls = []
    results = re.findall(r'(?:http\:|https\:)?\/\/.*\.(?:png|jpg)', text)
    for x in results:
      if not x.startswith('http:'):
        x = 'http:' + x
      urls.append(x)

    return urls

def get_images_from_url(url):
    resp = requests.get(url)
    urls = get_url_images_in_text(resp.text)
    print('urls', urls)

if __name__ == '__main__':
   get_images_from_url('http://stackoverflow.com')

would print:

('urls', [u'http://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-icon.png', u'http://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-icon@2.png', u'https://i.stack.imgur.com/tKsDb.png', u'https://i.stack.imgur.com/6HFc3.png', u'https://i.stack.imgur.com/aABck.png', u'https://i.stack.imgur.com/aABck.png', u'https://i.stack.imgur.com/tKsDb.png', u'https://i.stack.imgur.com/tKsDb.png'])

Jack
  • 20,735
  • 11
  • 48
  • 48
2

>

results = re.findall(r'(?:http\:|https\:)?\/\/.*\.(?:png|jpg)', text)

Better use this for shortest match (*? instead *):

results = re.findall(r'(?:http\:|https\:)?\/\/.*?\.(?:png|jpg)', text)