As mentioned in the comments a relative URL would make things tricky in which case using something like BeautifulSoup harder. That said, If a site serves over both http
and https
then they may not include the protocol (like //example.com/image.png
in their markup either.
So then you'd want to tweak your regex to something like this:
def get_url_images_in_text(text):
'''finds image urls'''
return re.findall(r'(?:http\:|https\:)?\/\/.*\.(?:png|jpg)', text)
A full example of what I think you're trying to do:
import re
import requests
def get_url_images_in_text(text):
'''finds image urls'''
urls = []
results = re.findall(r'(?:http\:|https\:)?\/\/.*\.(?:png|jpg)', text)
for x in results:
if not x.startswith('http:'):
x = 'http:' + x
urls.append(x)
return urls
def get_images_from_url(url):
resp = requests.get(url)
urls = get_url_images_in_text(resp.text)
print('urls', urls)
if __name__ == '__main__':
get_images_from_url('http://stackoverflow.com')
would print:
('urls',
[u'http://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-icon.png',
u'http://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-icon@2.png',
u'https://i.stack.imgur.com/tKsDb.png',
u'https://i.stack.imgur.com/6HFc3.png',
u'https://i.stack.imgur.com/aABck.png',
u'https://i.stack.imgur.com/aABck.png',
u'https://i.stack.imgur.com/tKsDb.png',
u'https://i.stack.imgur.com/tKsDb.png'])