2

Basically, my goal is to fetch the filename, extension and the content of an image by its url. And my fuction should work for both of these urls:

easy case: https://image.shutterstock.com/image-photo/bright-spring-view-cameo-island-260nw-1048185397.jpg

hard case (does not end with filename.extension ): https://images.unsplash.com/photo-1472214103451-9374bd1c798e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&w=1000&q=80

Currently, what I have looks like this:

from os.path import splitext, basename

def get_filename_from_url(url):
       result = urllib.request.urlretrieve(url)
       filename, file_ext = splitext(basename(result.path))
       print(filename, file_ext)

This works fine for the easy case. But apparently, no solution in case of hard-case url. But I have a feeling that that I can use python's requests module and parse the header to find the mimetype and then use the same module's guesstype functionality to extract the necessary data. So I went on to try this:

import requests

response = requests.get(url, stream=True)

Here, someone seems to describe the clue, saying that enter image description here

but the problem is that using the hard-case url I get something strange in the response dict items, and maybe my key issue is that I don't know the correct way to parse the header of the response to extract what I need.

I've tried a third approach using urlparse:

from urllib.parse import urlparse
result = urlparse(self.url)
print(os.path.basename(a.path)) # 'photo-1472214103451-9374bd1c798e'

which yields the filename, but again, I miss the extension here...

The ideal solution would be to get the filename, file extension and file content in one go, preferrably being able to validate that the url actually contains an image, not something else...

UPD:

The result1 elemet in result = urllib.request.urlretrieve(self.url) seems to contain the Content-Type, by I can't figure out how to extract it correctly.

Edgar Navasardyan
  • 4,261
  • 8
  • 58
  • 121
  • Without getting the file, it seems impossible to know what is inside of it, unless there is a mimetypes equivalent that works on links. What about this: https://stackoverflow.com/questions/10543940/check-if-a-url-to-an-image-is-up-and-exists-in-python#10543969 – asylumax Jun 04 '20 at 13:14

1 Answers1

2

One way is to query the content type:

>>> from urllib.request import urlopen
>>> response = urlopen(url)
>>> response.info().get_content_type()
'image/jpeg'

or using urlretrieve as in your edit:

>>> response = urllib.request.urlretrieve(url)
>>> response[1].get_content_type()
kabanus
  • 24,623
  • 6
  • 41
  • 74
  • and do you think `response[1].get_content_type().split('/')[0] == 'image'` whould be an appropriate validation that the url contains an image? – Edgar Navasardyan Jun 04 '20 at 15:20
  • @Edgar For most modern websites. You could fall back on a path check if there is no content type. If both are missing though, how would anyone (including your browser) know what the content is? It's up to the host if they want their stuff found. – kabanus Jun 04 '20 at 15:27
  • @kanbanus, do you mean that I should do what I said in the comment, and in case there is no content type, fall pack on a path to check it ? – Edgar Navasardyan Jun 04 '20 at 15:36