Basically, my goal is to fetch the filename, extension and the content of an image by its url. And my fuction should work for both of these urls:
easy case: https://image.shutterstock.com/image-photo/bright-spring-view-cameo-island-260nw-1048185397.jpg
hard case (does not end with filename.extension ): https://images.unsplash.com/photo-1472214103451-9374bd1c798e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&w=1000&q=80
Currently, what I have looks like this:
from os.path import splitext, basename
def get_filename_from_url(url):
result = urllib.request.urlretrieve(url)
filename, file_ext = splitext(basename(result.path))
print(filename, file_ext)
This works fine for the easy case. But apparently, no solution in case of hard-case url. But I have a feeling that that I can use python's requests
module and parse the header to find the mimetype and then use the same module's guesstype functionality to extract the necessary data. So I went on to try this:
import requests
response = requests.get(url, stream=True)
Here, someone seems to describe the clue, saying that
but the problem is that using the hard-case url I get something strange in the response
dict items, and maybe my key issue is that I don't know the correct way to parse the header of the response to extract what I need.
I've tried a third approach using urlparse:
from urllib.parse import urlparse
result = urlparse(self.url)
print(os.path.basename(a.path)) # 'photo-1472214103451-9374bd1c798e'
which yields the filename, but again, I miss the extension here...
The ideal solution would be to get the filename, file extension and file content in one go, preferrably being able to validate that the url actually contains an image, not something else...
UPD:
The result1 elemet in result = urllib.request.urlretrieve(self.url)
seems to contain the Content-Type
, by I can't figure out how to extract it correctly.