65

I'm using the Python requests library to get a PDF file from the web. This works fine, but I now also want the original filename. If I go to a PDF file in Firefox and click download it already has a filename defined to save the pdf. How do I get this filename?

For example:

import requests
r = requests.get('http://www.researchgate.net/profile/M_Gotic/publication/260197848_Mater_Sci_Eng_B47_%281997%29_33/links/0c9605301e48beda0f000000.pdf')
print r.headers['content-type']  # prints 'application/pdf'

I checked the r.headers for anything interesting, but there's no filename in there. I was actually hoping for something like r.filename..

Does anybody know how I can get the filename of a downloaded PDF file with the requests library?

funnydman
  • 9,083
  • 4
  • 40
  • 55
kramer65
  • 50,427
  • 120
  • 308
  • 488
  • Interesting – I was going to say, "well *obviously* `0c9605301e48beda0f000000.pdf`" (as that is in the request) but fortunately I decided to test it first. And FireFox wants to save it as "Mater Sci Eng B47 (1997) 33.pdf". – Jongware Aug 04 '15 at 09:04
  • 1
    How are you checking the headers? The filename _is_ there, `content-disposition : inline; filename="Mater Sci Eng B47 (1997) 33.pdf"`. FWIW, many PDFs have a [Title](http://stackoverflow.com/q/6367304/4014959) embedded in them, but not all, and it may not be easy to access if the PDF is in binary form. – PM 2Ring Aug 04 '15 at 09:18

8 Answers8

98

It is specified in an http header content-disposition. So to extract the name you would do:

import re
d = r.headers['content-disposition']
fname = re.findall("filename=(.+)", d)[0]

Name extracted from the string via regular expression (re module).

Nilpo
  • 4,675
  • 1
  • 25
  • 39
Eugene V
  • 2,830
  • 1
  • 14
  • 8
  • 1
    This wouldn't work if the file name is encoded as utf8. Any suggestion there? – Tony Abou-Assaleh Feb 21 '17 at 04:14
  • 7
    findall returns a list of matches. You would need an index like this `fname = re.findall("filename=(.+)", d)[0]`. – Nilpo Nov 14 '18 at 11:45
  • 1
    This one is incomplete, a filename can we enclosed in quotes. – Michael-O May 18 '20 at 23:16
  • 5
    @Michael-O try using `"filename=\"(.+)\""` to remove quotes – sheunglaili Oct 15 '20 at 01:42
  • 1
    Just a side case that sometimes expected filenames are not provided within headers, especially with social media CDN links. In that case, you can formulate your own base name (maybe parse the url for the root filename that you would like to use), and then ascertain the correct extension to use as a suffix with something like `resp.headers['Content-Type'].split('/')[-1]`. – weezilla Jun 17 '21 at 17:17
  • In my case, the regex did not work because my `'content-disposition'` also contains `filename=*UTF-8`: `'Content-Disposition': "attachment; filename=NameOfTheFile.zip; filename*=UTF-8''NameOfTheFile.zip"` – vvvvv Oct 03 '21 at 10:17
  • You can use `cgi.parse_header` and `email.header.decode_header` to parse the file name properly – sshilovsky Mar 16 '23 at 07:18
  • @vvvvv @tony-abou-assaleh, I use `unquote(header.split("filename*=")[1].replace('UTF-8\'\'',""))` for Unicode – matan h Jun 29 '23 at 08:35
20

Building on some of the other answers, here's how I do it. If there isn't a Content-Disposition header, I parse it from the download URL:

import re
import requests
from requests.exceptions import RequestException


url = 'http://www.example.com/downloads/sample.pdf'

try:
    with requests.get(url) as r:

        fname = ''
        if "Content-Disposition" in r.headers.keys():
            fname = re.findall("filename=(.+)", r.headers["Content-Disposition"])[0]
        else:
            fname = url.split("/")[-1]

        print(fname)
except RequestException as e:
    print(e)

There are arguably better ways of parsing the URL string, but for simplicity I didn't want to involve any more libraries.

ruohola
  • 21,987
  • 6
  • 62
  • 97
Nilpo
  • 4,675
  • 1
  • 25
  • 39
  • 1
    I suggest calling `urllib.parse.unquote` in the else clause so you don't get `%20`s in the filename. – Noumenon Jun 24 '21 at 00:04
11

Apparently, for this particular resource it is in:

r.headers['content-disposition']

Don't know if it is always the case, though.

Maksim Solovjov
  • 3,147
  • 18
  • 28
  • Not all responses contain the 'content-disposition' header, but as per one of the comments, it seems they are available in this case. – Abhinav Sood Jun 23 '18 at 22:20
9

easy python3 implementation to get filename from Content-Disposition:

import requests
response = requests.get(<your-url>)
print(response.headers.get("Content-Disposition").split("filename=")[1])
Akhilesh Joshi
  • 288
  • 3
  • 13
5

You can use werkzeug for options headers https://werkzeug.palletsprojects.com/en/0.15.x/http/#werkzeug.http.parse_options_header

>>> import werkzeug


>>> werkzeug.http.parse_options_header('text/html; charset=utf8')
('text/html', {'charset': 'utf8'})
funnydman
  • 9,083
  • 4
  • 40
  • 55
myildirim
  • 2,248
  • 2
  • 19
  • 25
2

According to the documentation, neither Content-Disposition nor its filename attribute is required. Also, I checked dozens links on the internet and haven't found responses with the Content-Disposition header. So, in most cases, I wouldn't rely on it much and just retrieve this information from the request URL (note: I'm taking it from req.url because there could be redirection and we want to get real filename). I used werkzeug because it looks more robust and handles quoted and unquoted filenames. Eventually, I came up with this solution (works since Python 3.8):

from urllib.parse import urlparse

import requests
import werkzeug


def get_filename(url: str):
    try:
        with requests.get(url) as req:
            if content_disposition := req.headers.get("Content-Disposition"):
                param, options = werkzeug.http.parse_options_header(content_disposition)
                if param == 'attachment' and (filename := options.get('filename')):
                    return filename

            path = urlparse(req.url).path
            name = path[path.rfind('/') + 1:]
            return name
    except requests.exceptions.RequestException as e:
        raise e

I wrote some tests using pytest and requests_mock:

import pytest
import requests
import requests_mock

from main import get_filename

TEST_URL = 'https://pwrk.us/report.pdf'


@pytest.mark.parametrize(
    'headers,expected_filename',
    [
        (
                {'Content-Disposition': 'attachment; filename="filename.pdf"'},
                "filename.pdf"
        ),
        (
                # The string following filename should always be put into quotes;
                # but, for compatibility reasons, many browsers try to parse unquoted names that contain spaces.
                # https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Disposition#directives
                {'Content-Disposition': 'attachment; filename=filename with spaces.pdf'},
                "filename with spaces.pdf"
        ),
        (
                {'Content-Disposition': 'attachment;'},
                "report.pdf"
        ),
        (
                {'Content-Disposition': 'inline;'},
                "report.pdf"
        ),
        (
                {},
                "report.pdf"
        )
    ]
)
def test_get_filename(headers, expected_filename):
    with requests_mock.Mocker() as m:
        m.get(TEST_URL, text='resp', headers=headers)
        assert get_filename(TEST_URL) == expected_filename


def test_get_filename_exception():
    with requests_mock.Mocker() as m:
        m.get(TEST_URL, exc=requests.exceptions.RequestException)
        with pytest.raises(requests.exceptions.RequestException):
            get_filename(TEST_URL)
funnydman
  • 9,083
  • 4
  • 40
  • 55
1

Use urllib.request instead of requests because then you can do urllib.request.urlopen(...).headers.get_filename(), which is safer than some of the other answers for the following reason:

If the [Content-Disposition] header does not have a filename parameter, this method falls back to looking for the name parameter on the Content-Type header.

After that, even safer would be to additionally fall back to the filename in the URL, as another answer does.

root
  • 1,812
  • 1
  • 12
  • 26
0

This is an Interesting Challenge as I raises more new questions than answers. Here is the OP link as seen in my FireFox clearly as a "PDF" If I accept given name it autosaves as MaterSciEngB47199733.pdf

enter image description here

The name that FireFox uses may be different to Chrome so for the given example tested exactly same link using Edge, and got very similar response.

However both FireFox & MS Edge will show tabbed PII: S0921-5107(96)02041-7 and NOT OFFER for saving its known "Filename" Mater-Sci-Eng-B47-1997-33.pdf but a much shorter MaterSciEngB47199733.pdf

And since the user wants the "Real Name" they can manually edit it at will back to Mater-Sci-Eng-B47-1997-33.pdf or Mater Sci Eng B47 (1997) 33.pdf since a Curl by any other name is just as good.

K J
  • 8,045
  • 3
  • 14
  • 36