Hi there and welcome to Stack Overflow!
If you want to use Python use the requests
library to get the initial page to check out the content (you will need to install it first via pip
or pipenv
):
>>> import requests
>>> r = requests.get('http://dof.gob.mx/nota_to_pdf.php?fecha=25/07/2018&edicion=MAT')
>>> r.status_code
200
>>> r.headers['content-type']
'text/html; charset=UTF-8'
>>> r.encoding
'UTF-8'
>>> r.text
'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\r\n<html
xmlns="http://www.w3.org/1999/xhtml">\r\n<head>\r\n<meta http-equiv="Content-
Type" content="text/html; charset=utf-8" />\r\n<title>Diario Oficial de la
Federación</title>\r\n</head>\r\n\r\n<body>\r\n<script>\r\n
(function(i,s,o,g,r,a,m){i[\'GoogleAnalyticsObject\']=r;i[r]=i[r]||function() .
{\r\n (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new
Date();a=s.createElement(o),\r\n m=s.getElementsByTagName(o
[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)\r\n })
(window,document,\'script\',\'//www.google-
analytics.com/analytics.js\',\'ga\');\r\n\r\n ga(\'create\', \'UA-32467343-1\', \'auto\');\r\n
ga(\'send\', \'pageview\');\r\n\r\n</script>\r\n</body>\r\n</html><script>
self.location=(\'abrirPDF.php?archivo=25072018-MAT.pdf&anio=2018&repo=\');
</script><html>\n<head>\n<meta http-equiv="Content-Type" content="text/html;
charset=iso-8859-1">\n\t<script>\n\tfunction BorrarPDF()
\n\t{\n\t\tdocument.getElementById(\'cerrar\').src=\'cerrar_doc_imagen.php
archivo=\'+document.getElementById(\'pdf\').value;\n\t}\n\t
</script>\n</head>\n<body onUnload="BorrarPDF()">\n\n
<input type="hidden" value="25072018-MAT.pdf" id="pdf" name="pdf">\n\n
<iframe id="cerrar" width="1px" height="1px" scrolling="no"
frameborder="0" marginwidth="0px" marginheight="0px">
</iframe>\n\n</body>\n</html>\n'
If you dig through that HTML, you will see that the page uses self.location
to redirect to the PDF file when the page loads.
The actual URL for the PDF is:
http://dof.gob.mx/abrirPDF.php?archivo=25072018-MAT.pdf&anio=2018&repo=
So, if you do the same process again with the requests
library, this time specifying the actual PDF file:
>>> import requests
>>> r = requests.get('http://dof.gob.mx/abrirPDF.php?archivo=25072018-MAT.pdf&anio=2018&repo=')
>>> r.status_code
200
>>> r.headers['content-type']
'application/pdf'
Now you have the PDF in the body of the request.
You can do the same with cURL
-- you just need to ensure that you are grabbing the right thing (which admittedly is obfuscated by the webpage javascript function, probably by design).
I hope that helps!