Extracting file name from url when its name is not in url

Question

So I wanted to create a download manager, which can download multiple files automatically. I had a problem however with extracting the name of the downloaded file from the url. I tried an answer to How to extract a filename from a URL and append a word to it?, more specifically

a = urlparse(URL)
file = os.path.basename(a.path)

but all of them, including the one shown, break when you have a url such as

URL = https://calibre-ebook.com/dist/win64

Downloading it in Microsoft Edge gives you file with the name of calibre-64bit-6.5.0.msi, but downloading it with python, and using the method from the other question to extract the name of the file, gives you win64 instead, which is the intended file.

I expect that the URL results in a redirect (304) to the download in question. So to get the final path, you'd need to do the request in python and get the new redirected URL. May be able to do this with HEAD if you don't want to download the item if there is not 304. — saquintes, Sep 21 '22 at 16:50

score 0 · Answer 1 · answered Sep 21 '22 at 16:57

The URL results in a 302 redirect, so you don't have enough information with just the URL to get that basename. You have to get the URL from 302 response.

import requests

resp = requests.head("https://calibre-ebook.com/dist/win64")

print(resp.status_code, resp.headers['location'])

>>> 302 https://download.calibre-ebook.com/6.5.0/calibre-64bit-6.5.0.msi

You'd want to have more intelligent handling obviously in case it's not a 302. And you'd want to loop in case the new URL results in another redirect.

score 0 · Accepted Answer · answered Sep 21 '22 at 17:09

The URL https://calibre-ebook.com/dist/win64 is a HTTP 302 redirect to another URL https://download.calibre-ebook.com/6.5.0/calibre-64bit-6.5.0.msi. You can see this by running a HEAD request, for example in a macOS/Linux terminal (note 302 and the location header):

$ curl --head https://calibre-ebook.com/dist/win64
HTTP/2 302
server: nginx
date: Wed, 21 Sep 2022 16:54:49 GMT
content-type: text/html
content-length: 138
location: https://download.calibre-ebook.com/6.5.0/calibre-64bit-6.5.0.msi

The browser follows the HTTP redirect and downloads the file, naming it based on the last URL. If you'd like to do the same in Python, you also need to get to the last URL and use that as the file name. The requests library might or might not follow these redirects depending on the version, better to explicitly use allow_redirects=True.

With requests==2.28.1 this code returns the last URL:

import requests
requests.head('https://calibre-ebook.com/dist/win64', allow_redirects=True).url
# 'https://download.calibre-ebook.com/6.5.0/calibre-64bit-6.5.0.msi'

If you'd like to solve it with built-in modules so you won't need to install external libs like requests you can also achieve the same with urllib:

import urllib.request
opener=urllib.request.build_opener()
opener.open('https://calibre-ebook.com/dist/win64').geturl()
# 'https://download.calibre-ebook.com/6.5.0/calibre-64bit-6.5.0.msi'

Then you can split the lat URL by / and get the last section as the file name, for example:

import urllib.request
opener=urllib.request.build_opener()
url = opener.open('https://calibre-ebook.com/dist/win64').geturl()
url.split('/')[-1]
# 'calibre-64bit-6.5.0.msi'

I was using urllib3==1.26.12, requests==2.28.1 and Python 3.8.9 in the examples, if you are using much older versions they might behave differently and might need extra flags to ensure redirects.

Extracting file name from url when its name is not in url

2 Answers2