-1

I want to download a PDF file using Python. I am aware that there are several SO Q&A about this issue. However, I was only able to find cases where the URL follows this format http://www.example.com/example.pdf.

The URL I use to download the file is the following: http://dof.gob.mx/nota_to_pdf.php?fecha=25/07/2018&edicion=MAT. If I open a browser and paste the URL in the search bar, I am taken to a blank page where I am prompted to save the file.

When I try to use the methods shown in several tutorial sites, or try to follow the advise I find in other SO questions, I am only able to download the HTML, the same thing happens when I try to do it using curl in the terminal.

Any help will be deeply appreciated.

Nerdrigo
  • 308
  • 1
  • 5
  • 14

2 Answers2

4

Hi there and welcome to Stack Overflow!

If you want to use Python use the requests library to get the initial page to check out the content (you will need to install it first via pip or pipenv):

>>> import requests
>>> r = requests.get('http://dof.gob.mx/nota_to_pdf.php?fecha=25/07/2018&edicion=MAT')
>>> r.status_code
200
>>> r.headers['content-type']
'text/html; charset=UTF-8'
>>> r.encoding
'UTF-8'
>>> r.text
'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" 
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\r\n<html 
xmlns="http://www.w3.org/1999/xhtml">\r\n<head>\r\n<meta http-equiv="Content- 
Type" content="text/html; charset=utf-8" />\r\n<title>Diario Oficial de la 
Federación</title>\r\n</head>\r\n\r\n<body>\r\n<script>\r\n  
(function(i,s,o,g,r,a,m){i[\'GoogleAnalyticsObject\']=r;i[r]=i[r]||function() . 
{\r\n  (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new
Date();a=s.createElement(o),\r\n  m=s.getElementsByTagName(o
[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)\r\n  })
(window,document,\'script\',\'//www.google-
analytics.com/analytics.js\',\'ga\');\r\n\r\n  ga(\'create\', \'UA-32467343-1\', \'auto\');\r\n
ga(\'send\', \'pageview\');\r\n\r\n</script>\r\n</body>\r\n</html><script> 
self.location=(\'abrirPDF.php?archivo=25072018-MAT.pdf&anio=2018&repo=\'); 
</script><html>\n<head>\n<meta http-equiv="Content-Type" content="text/html; 
charset=iso-8859-1">\n\t<script>\n\tfunction BorrarPDF()
\n\t{\n\t\tdocument.getElementById(\'cerrar\').src=\'cerrar_doc_imagen.php
archivo=\'+document.getElementById(\'pdf\').value;\n\t}\n\t
</script>\n</head>\n<body onUnload="BorrarPDF()">\n\n
<input type="hidden" value="25072018-MAT.pdf" id="pdf" name="pdf">\n\n
<iframe id="cerrar" width="1px" height="1px" scrolling="no" 
frameborder="0" marginwidth="0px" marginheight="0px">
</iframe>\n\n</body>\n</html>\n'

If you dig through that HTML, you will see that the page uses self.location to redirect to the PDF file when the page loads.

The actual URL for the PDF is:

http://dof.gob.mx/abrirPDF.php?archivo=25072018-MAT.pdf&anio=2018&repo=

So, if you do the same process again with the requests library, this time specifying the actual PDF file:

>>> import requests
>>> r = requests.get('http://dof.gob.mx/abrirPDF.php?archivo=25072018-MAT.pdf&anio=2018&repo=')
>>> r.status_code
200
>>> r.headers['content-type']
'application/pdf'

Now you have the PDF in the body of the request.

You can do the same with cURL -- you just need to ensure that you are grabbing the right thing (which admittedly is obfuscated by the webpage javascript function, probably by design).

I hope that helps!

tatlar
  • 3,080
  • 2
  • 29
  • 40
  • Thank you so much for your help, I'll definetly keep that in mind next time I try to download somehting. It was easier (for me) to download the file using `cUrl` after your observation, now I'm going to try and do it with python. – Nerdrigo Jul 25 '18 at 23:20
1

A function doing the job with a progress bar:

from tqdm import tqdm
import requests

def download_file( url, filename):
    response = requests.get(url, stream=True)

    with open(filename, "wb") as handle:
        for data in tqdm(response.iter_content()):
            handle.write(data)
Learning is a mess
  • 7,479
  • 7
  • 35
  • 71