1

I have a link to a PDF file that I would like to download. I tried the following:

import requests

class Scraper:

    def __init__(self):
        """Init the class"""

    @staticmethod
    def download(full_url):
        """Download full url pdf"""
        with requests.Session() as req:

            # Init
            r = req.get(full_url, allow_redirects=True)
            localname = 'test.pdf'

            # Download
            if r.status_code == 200: #and r.headers['Content-Type'] == "application/pdf;charset=UTF-8":
                with open(f"{localname}", 'wb') as f:
                    f.write(r.content)
            else:
                pass

However, after downloading, when I try to open it on my computer I receive the message:

"Could not open [FILENAME].pdf because it is either not a supported file type or because the file has been damaged (...)"

  • What is the reason for this? Is it because the first time you visit this page you get redirected and you need to select some preferences?
  • How can we resolve this?
WJA
  • 6,676
  • 16
  • 85
  • 152

1 Answers1

2

Actually you haven't passed the required parameters for starting the download, as if you have navigate to the url, you will see that you need to Click continue in order to start the download. what's happening in the bacground is GET request to the back-end with the following parameters ?switchLocale=y&siteEntryPassthrough=true to start the download.

You can view that under developer-tools within your browser and navigate to the Network-Tab section.

import requests


params = {
    'switchLocale': 'y',
    'siteEntryPassthrough': 'true'
}


def main(url, params):
    r = requests.get(url, params=params)
    with open("test.pdf", 'wb') as f:
        f.write(r.content)


main("https://www.blackrock.com/uk/individual/literature/annual-report/blackrock-index-selection-fund-en-gb-annual-report-2019.pdf", params)
  • Ok, so you need to know these type of params beforehand? – WJA Apr 03 '20 at 15:41
  • @JohnAndrews Indeed. that's is `pre-defined` `parameters` by the server side. Unless you implement a `machine learning` model to deal with all cases, such as following `click buttons` or locating the `frame` of `download` words. a long way discussion :P – αԋɱҽԃ αмєяιcαη Apr 03 '20 at 15:43
  • Would love to have that discussion someday :) – WJA Apr 03 '20 at 15:43
  • @JohnAndrews [sickit-learn](https://scikit-learn.org/stable/) will be your best friend for that tasks – αԋɱҽԃ αмєяιcαη Apr 03 '20 at 15:46
  • That is a good reference. But how do you define whether the pdf has actually been downloaded correctly? That would be a good input to your model, to be able to know if it failed or succeeded. – WJA Apr 03 '20 at 15:47
  • @JohnAndrews something like `print(r.headers.get("Content-Type"))` or using `magic` for checking [content](https://stackoverflow.com/questions/43580/how-to-find-the-mime-type-of-a-file-in-python) – αԋɱҽԃ αмєяιcαη Apr 03 '20 at 15:55