how can I download files from the URL that does not have file name in it and how to get to know its file extension or file type

Question

this is an URL example "https://procurement-notices.undp.org/view_file.cfm?doc_id=257280" if you put it in the browser a file will start downloading in your system.

I want to download this file using python and store it somewhere on my computer

this is how tried

import requests
# first_url = 'https://readthedocs.org/projects/python-guide/downloads/pdf/latest/'
second_url="https://procurement-notices.undp.org/view_file.cfm?doc_id=257280"

myfile = requests.get(second_url , allow_redirects=True)

# this works for the first URL
# open('example.pdf' , 'wb').write(myfile.content)

# this did't work for both of them
# open('example.txt' , 'wb').write(myfile.content)

# this works for the second URL
open('example.doc' , 'wb').write(myfile.content)

first: if I put the first_url in the browser it will download a pdf file, putting second_url will download a .doc file How can I know what type of file will the URL give to us or what type of file will be downloaded so that I use the correct open(...) method?

second: If I use the second URL in the browser a file with the name "T__proc_notices_notices_080_k_notice_doc_79545_770020123.docx" starts downloading. how can I know this file name when I try to download the file?

if you know any better solution kindly let me know for the implementation.

kindly have a quick look at Downloading Files from URLs and zip downloaded files in python question aswell

gaurav · Answer 1 · 2021-06-15T07:57:20.150

you can use response headers content-type like for first url it is application/pdf and sencond url for is application/msword you save file according to it. you can make extension dictinary where you can store possible file format and their types and match with it. your second question is also same like this one so i am taking your two urls from that question and for file name i am using just integers

all_Urls = ['https://omextemplates.content.office.net/support/templates/en-us/tf16402488.dotx' ,
            'https://procurement-notices.undp.org/view_file.cfm?doc_id=257280']

extension_dict = {'application/vnd.openxmlformats-officedocument.wordprocessingml.document':'.docx', 
    'application/vnd.openxmlformats-officedocument.wordprocessingml.template':'.dotx',
    'application/vnd.ms-word.document.macroEnabled.12':'.docm', 
    'application/vnd.ms-word.template.macroEnabled.12':'.dotm',
    'application/pdf':'.pdf',
    'application/msword':'.doc'}

for i,url in enumerate(all_Urls):
    resp = requests.get(url)
    response_headers = resp.headers
    
    file_extension = extensio_dict[response_headers['Content-Type']]

    with open(f"{i}.{file_extension}",'wb') as f:
        f.write(resp.content)

for MIME-Type see this answer

Ole · Answer 2 · 2021-06-15T07:41:49.903

1

myfile.headers['content-type'] will give you the MIME-type of the URL's content and myfile.headers['content-disposition'] gives you info like filename etc. (if the response contains this header at all)

edited Jun 15 '21 at 07:41

answered Jun 15 '21 at 07:12

Ole

188
10

Actually, `myfile.headers['content-disposition']` gives you more information than it should for `second_url`, it contains the full path to the file on the host computer... – Ole Jun 15 '21 at 07:15

how can I download files from the URL that does not have file name in it and how to get to know its file extension or file type

2 Answers2