2

Was hoping someone could help me understand what's going on:

I'm using Selenium with Firefox browser to download a pdf (need Selenium to login to the corresponding website):

    le = browser.find_elements_by_xpath('//*[@title="Download PDF"]')
    time.sleep(5)
    if le:
        pdf_link = le[0].get_attribute("href")
        browser.get(pdf_link)

The code does download the pdf, but after that just stays idle. This seems to be related to the following browser settings:

   fp.set_preference("pdfjs.disabled", True)
   fp.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/pdf")

If I disable the first, it doesn't hang, but opens pdf instead of downloading it. If I disable the second, a "Save As" pop-up window shows up. Could someone explain how to handle this?

LazyCat
  • 496
  • 4
  • 14

1 Answers1

1

For me, the best way to solve this was to let Firefox render the PDF in the browser via pdf.js and then send a subsequent fetch via the Python requests library with the selenium cookies attached. More explanation below:

There are several ways to render a PDF via Firefox + Selenium. If you're using the most recent version of Firefox, it'll most likely render the PDF via pdf.js so you can view it inline. This isn't ideal because now we can't download the file.

You can disable pdf.js via Selenium options but this will likely lead to the issue in this question where the browser gets stuck. This might be because of an unknown MIME-Type but I'm not totally sure. (There's another StackOverflow answer that says this is also due to Firefox versions.)

However, we can bypass this by passing Selenium's cookie session to requests.session().

Here's a toy example:

import requests
from selenium import webdriver

pdf_url = "/url/to/some/file.pdf"

# setup driver with options
driver = webdriver.Firefox(..options)

# do whatever you need to do to auth/login/click/etc.

# navigate to the PDF URL in case the PDF link issues a 
# redirect because requests.session() does not persist cookies
driver.get(pdf_url)

# get the URL from Selenium 
current_pdf_url = driver.current_url

# create a requests session
session = requests.session()

# add Selenium's cookies to requests
selenium_cookies = driver.get_cookies()
for cookie in selenium_cookies:
    session.cookies.set(cookie["name"], cookie["value"])

# Note: If headers are also important, you'll need to use 
# something like seleniumwire to get the headers from Selenium 

# Finally, re-send the request with requests.session
pdf_response = session.get(current_pdf_url)

# access the bytes response from the session
pdf_bytes = pdf_response.content

I highly recommend using seleniumwire over regular selenium because it extends Python Selenium to let you return headers, wait for requests to finish, use proxies, and much more.

aboutaaron
  • 4,869
  • 3
  • 36
  • 30