11

I am new to selenium and I am writing a scraper to download pdf files automatically from a given site.

Below is my code:

from selenium import webdriver

fp = webdriver.FirefoxProfile()

fp.set_preference("browser.download.folderList",2);
fp.set_preference("browser.download.manager.showWhenStarting",False)
fp.set_preference("browser.download.dir", "/home/jill/Downloads/Dinamalar")
fp.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/pdf")

browser = webdriver.Firefox(firefox_profile=fp)
browser.get("http://epaper.dinamalar.com/PUBLICATIONS/DM/MADHURAI/2015/05/26/PagePrint//26_05_2015_001_b2b69fda315301809dda359a6d3d9689.pdf");
webobj = browser.find_element_by_id("download").click();

I followed the steps mentioned in Selenium documentation and in the this link. I am not sure why download dialog box is getting shown every time.

Is there anyway to fix it else can there be a way to give "application/all" so that all the files can be downloaded (work-around)?

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
Gaara
  • 695
  • 3
  • 8
  • 23

3 Answers3

20

Disable the built-in pdfjs plugin and navigate to the URL - the PDF file would be downloaded automatically, the code:

from selenium import webdriver

fp = webdriver.FirefoxProfile()

fp.set_preference("browser.download.folderList", 2)
fp.set_preference("browser.download.manager.showWhenStarting",False)
fp.set_preference("browser.download.dir", "/home/jill/Downloads/Dinamalar")
fp.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/pdf,application/x-pdf")

fp.set_preference("pdfjs.disabled", "true")  # < KEY PART HERE

browser = webdriver.Firefox(firefox_profile=fp)
browser.get("http://epaper.dinamalar.com/PUBLICATIONS/DM/MADHURAI/2015/05/26/PagePrint//26_05_2015_001_b2b69fda315301809dda359a6d3d9689.pdf");

Update (the complete code that worked for me):

from selenium import webdriver

mime_types = "application/pdf,application/vnd.adobe.xfdf,application/vnd.fdf,application/vnd.adobe.xdp+xml"

fp = webdriver.FirefoxProfile()
fp.set_preference("browser.download.folderList", 2)
fp.set_preference("browser.download.manager.showWhenStarting", False)
fp.set_preference("browser.download.dir", "/home/aafanasiev/Downloads")
fp.set_preference("browser.helperApps.neverAsk.saveToDisk", mime_types)
fp.set_preference("plugin.disable_full_page_plugin_for_types", mime_types)
fp.set_preference("pdfjs.disabled", True)

browser = webdriver.Firefox(firefox_profile=fp)
browser.get("http://epaper.dinamalar.com/")

webobj_get_link = browser.find_element_by_id("liSavePdf")
webobj_get_object = webobj_get_link.find_element_by_tag_name("a")
webobj_get_object.click()
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • I still face the issue even after the code mentioned. Any chance OS has any part in this? I use UBUNTU 14.04. – Gaara May 26 '15 at 11:02
  • @Gaara interesting, it works for me: selenium 2.45 + firefox 35.0.1 on Mac. – alecxe May 26 '15 at 11:04
  • mine is Selenium 2.45.0, Ubuntu 14.04 firefox 38.0. I am trying every possibility. Downloads pop up window does not come under window handle as well. It does not fall under alert. Any ideas on what more can be done? I can post a link to my script if you want. – Gaara May 26 '15 at 11:13
  • @Gaara yes, please share the current code you are executing. Thanks. – alecxe May 26 '15 at 11:14
  • Thanks a lot. Here is the link http://www.codeskulptor.org/#user40_loV03Asao9_0.py Function "download_page_from_child_link()" is responsible for clicking the "download" button and invoking the download dialog box. please let me know if you need any information – Gaara May 26 '15 at 11:21
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/78809/discussion-between-gaara-and-alecxe). – Gaara May 26 '15 at 13:39
  • Coming to this late, but it seems like maybe firefox has added new options, and this doesn't work anymore. In the Firefox preferences I see in Applications that Portable Document Format is set to Preview in Firefox, and have confirmed that if it is set to save file, the download will work properly, but I'm not sure how to find out what profile option I can use in code to do that. – UltraBob Nov 15 '18 at 03:02
1

I tested the following code and I succesfully downloaded your pdf on Windows 7:

fp = webdriver.FirefoxProfile()
fp.set_preference("browser.download.folderList", 2)
fp.set_preference("browser.download.manager.showWhenStarting", False)
fp.set_preference("browser.download.dir", download_location)
fp.set_preference("plugin.disable_full_page_plugin_for_types", "application/pdf")
fp.set_preference("pdfjs.disabled", True)
fp.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/pdf")



driver = webdriver.Firefox(fp)
driver.implicitly_wait(10)
driver.maximize_window()
driver.get("http://epaper.dinamalar.com/")
element = driver.find_element_by_css_selector("li#liSavePdf>a>img")
element.click()
esoleco
  • 74
  • 1
  • 7
0

Since there is not HTML code available, my guess is that this line

webobj = browser.find_element_by_id("download").click();

actually calls the onclick event, but you don't handle it properly. In other words, what you're missing is the location where this .pdf file will be stored. I have very little experience with python programming, but one solution could be to use HTTP webclient lib, that will allow you to automatically download files. Something like CSharp's WebClient.DownloadFile Method (String, String). And if used properly, you can skip any Selenium commands for this action.

Maybe something like this post will be a good start.

Community
  • 1
  • 1
ekostadinov
  • 6,880
  • 3
  • 29
  • 47