4

I have a selenium script that as part of it's execution needs to download a PDF, and the download is necessary as the PDF is used later on. I have used the profile preferences method to get the file to download, and this has been working fine on the virtual machine I have used for development, however when moving the script to the live server it does not seem to want to download the required PDF at all. Here are the lines I have used to set up the firefox profile:

fxProfile = webdriver.FirefoxProfile()
fxProfile.set_preference("browser.download.folderList",2)
fxProfile.set_preference("browser.download.manager.showWhenStarting",False)
fxProfile.set_preference("browser.download.dir",foldername)
fxProfile.set_preference("browser.helperApps.neverAsk.saveToDisk","application/pdf")
fxProfile.set_preference("pdfjs.disabled",True)
fxProfile.set_preference("plugin.scan.Acrobat", "99.0");
fxProfile.set_preference("plugin.scan.plid.all", False);
fxProfile.set_preference("plugin.disable_full_page_plugin_for_types", "application/pdf")
fxProfile.set_preference("browser.helperApps.alwaysAsk.force", False);
driver = webdriver.Firefox(firefox_profile=fxProfile)

On the virtual machine the preferences lines ended at disabling pdfjs and this worked fine, after that is extra lines I have tried to solve the problem on the live machine.

The variable foldername is correct as the same variable is used to open and write to a log fail which functions fine. As far as I can tell an OS level window to confirm the download is not being opened as I can still direct the script to click on other parts of the site after the download link has been clicked. I am also making sure I give the script enough time to download the file (30+ seconds to download a sub 1mb PDF on a wired connection should be more than enough).

The problem is the live machine is a server and as such has no physical screen for me to see exactly what's happening, making this much harder to fix. Again, it works fine on my virtual machine where I can see what's happening, but fails to download the PDF every single time on the live server, without throwing any sort of error.

AntlerFox
  • 180
  • 1
  • 14
  • Could you put a breakpoint around the point of downloading and look in the console+network tab of the browser? I'm suspecting it *can* be downloaded, but it doesn't know how to open it. – Len Oct 27 '16 at 08:52
  • 1
    I can't directly look at it since the script is running on a server with no physical screen, I'm using pyvirtualdisplay, if there's a still away to do this I can certainly try, but I don't know how – AntlerFox Oct 27 '16 at 08:55
  • I think it's worth investing in a setup where you can your tests either locally or somewhere else with a screen. Sorry for not being able to help much more. – Len Oct 27 '16 at 09:01
  • @AntlerFox, Are you sure that request should return file as `application/pdf`? You can try to use different `MIME` type: `application/x-pdf, application/acrobat, applications/vnd.pdf, text/pdf, text/x-pdf, application/vnd.cups-pdf` – Andersson Oct 27 '16 at 09:03
  • Unfortunately it's just not feasible, I have my local machine with an ubuntu VM fo development but the end product is required to work headless on a server, unfortunately making that swap is where the error has occured – AntlerFox Oct 27 '16 at 09:03
  • @Andersson no such luck, thank you for the tip though, will remember to check mime types more thoroughly in the future in general – AntlerFox Oct 27 '16 at 09:08

2 Answers2

1

I solved this problem by passing the selenium session to the Python requests library and then fetching the PDF from there. I have a longer writeup in this StackOverflow answer, but here's a quick example:

import requests
from selenium import webdriver

pdf_url = "/url/to/some/file.pdf"

# setup webdriver with options 
driver = webdriver.Firefox(..options)

# do whatever you need to do to auth/login/click/etc.

# navigate to the PDF URL in case the PDF link issues a 
# redirect because requests.session() does not persist cookies
driver.get(pdf_url)

# get the URL from Selenium 
current_pdf_url = driver.current_url

# create a requests session
session = requests.session()

# add Selenium's cookies to requests
selenium_cookies = driver.get_cookies()
for cookie in selenium_cookies:
    session.cookies.set(cookie["name"], cookie["value"])

# Note: If headers are also important, you'll need to use 
# something like seleniumwire to get the headers from Selenium 

# Finally, re-send the request with requests.session
pdf_response = session.get(current_pdf_url)

# access the bytes response from the session
pdf_bytes = pdf_response.content
aboutaaron
  • 4,869
  • 3
  • 36
  • 30
0

Do yourself a tremendous favor and just use Chrome instead of Firefox. Works like a charm and completely circumvents the pdfjs.disable browser going idle...

In my experience, different Browsers have varying pros & cons for different Selenium use cases.

from selenium.webdriver import ChromeOptions
from selenium.webdriver.chrome.options import Options

# Set options for PDF in browser to save as / print to a local file
options = ChromeOptions()
options.headless = False
options.add_experimental_option('prefs',  {
      "download.default_directory": out_path,
      "download.prompt_for_download": False,
      "download.directory_upgrade": True,
      "plugins.always_open_pdf_externally": True
      }
  )

driver = webdriver.Chrome(options=options)

# get the pdf downloaded locally, give it 15 seconds wait time
sleep(5)
driver.get(url)
print('okay, probably downloaded...')
sleep(10)
print('okay, done sleeping...')
driver.close()
Etienne Jacquot
  • 176
  • 1
  • 9