2

I want to download a PDF which is from an Online-Magazin. In order to open it, must log in first. Then open the PDF and download it.

The following is my code. It can login to the page and the PDF can also be open. But the PDF can not be downloaded since I am not sure how to simulate the click on Save. I use FireFox.

import os, time
from selenium import webdriver
from bs4 import BeautifulSoup

# Use firefox dowmloader to get file
fp = webdriver.FirefoxProfile()
fp.set_preference("browser.download.folderList",2)
fp.set_preference("browser.download.manager.showWhenStarting",False)
fp.set_preference("browser.download.dir", 'D:/eBooks/Stocks_andCommodities/2008/Jul/')
fp.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/pdf")
fp.set_preference("pdfjs.disabled", "true")

# disable Adobe Acrobat PDF preview plugin
fp.set_preference("plugin.scan.plid.all", "false")
fp.set_preference("plugin.scan.Acrobat", "99.0")

browser = webdriver.Firefox(firefox_profile=fp)

# Get the login web page
web_url = 'http://technical.traders.com/sub/sublogin2.asp'
browser.get(web_url)

# SImulate the authentication
user_name = browser.find_element_by_css_selector('#SubID > input[type="text"]')
user_name.send_keys("thomas2003@test.net")
password = browser.find_element_by_css_selector('#SubName > input[type="text"]')
password.send_keys("LastName")
time.sleep(2)
submit = browser.find_element_by_css_selector('#SubButton > input[type="submit"]')
submit.click()
time.sleep(2)

# Open the PDF for downloading
url = 'http://technical.traders.com/archive/articlefinal.asp?file=\V26\C07\\131INTR.pdf'
browser.get(url)
time.sleep(10)

# How to simulate the Clicking to Save/Download the PDF here?
thomas2013ch
  • 133
  • 2
  • 11

3 Answers3

6

You should not open the file in browser. Once you have the file url. Get a request session with all the cookies

def get_request_session(driver):
    import requests
    session = requests.Session()
    for cookie in driver.get_cookies():
        session.cookies.set(cookie['name'], cookie['value'])

    return session

Once you have the session you can download the file using the same

url = 'http://technical.traders.com/archive/articlefinal.asp?file=\V26\C07\\131INTR.pdf'
session = get_request_session(driver)
r = session.get(url, stream=True)
chunk_size = 2000
with open('/tmp/mypdf.pdf', 'wb') as file:
    for chunk in r.iter_content(chunk_size):
        file.write(chunk)
Tarun Lalwani
  • 142,312
  • 9
  • 204
  • 265
  • I've attached your code to my code after # How to ... In deed a PDF file is downloaded. But it is not a valid PDF file. As I open it, it can't be opened. And I find the size of the downloaded PDF is much smaller than the right one. I post my whole code again below. Could you please have a look? – thomas2013ch Sep 04 '17 at 18:52
1

Apart from Tarun's solution, you can also download the file through js and store it as a blob. Then you can extract the data into python via selinium's execute script as shown in this answer.

In you case,

url = 'http://technical.traders.com/archive/articlefinal.asp?file=\V26\C07\\131INTR.pdf'
browser.execute_script("""
    window.file_contents = null;
    var xhr = new XMLHttpRequest();
    xhr.responseType = 'blob';
    xhr.onload = function() {
        var reader  = new FileReader();
        reader.onloadend = function() {
            window.file_contents = reader.result;
        };
        reader.readAsDataURL(xhr.response);
    };
    xhr.open('GET', %(download_url)s);
    xhr.send();
""".replace('\r\n', ' ').replace('\r', ' ').replace('\n', ' ') % {
    'download_url': json.dumps(url),
})

Now your data exists as a blob on the window object, so you can easily extract into python:

time.sleep(3)
downloaded_file = driver.execute_script("return (window.file_contents !== null ? window.file_contents.split(',')[1] : null);")
with open('/Users/Chetan/Desktop/dummy.pdf', 'wb') as f:
    f.write(base64.b64decode(downloaded_file))
TheChetan
  • 4,440
  • 3
  • 32
  • 41
  • I attached your code to my code after the # How to... But as I run the program I got error as: TypeError: argument should be a bytes-like object or ASCII string, not 'NoneType'. I post my code again below. Could you have a look? – thomas2013ch Sep 04 '17 at 19:04
  • Try adding a wait before the second part. I think this is happening because you are trying to get the contents of the variable before the onload function has completed – TheChetan Sep 05 '17 at 01:55
  • Hi TheChetan, I set a longer pause and indeed, the PDF is downloaded. Thanks a lot! – thomas2013ch Sep 05 '17 at 02:48
0

Try

  import urllib    
  file_path = "<FILE PATH TO SAVE>"
  urllib.urlretrieve(<pdf_link>,file_path)
Anurag Meena
  • 166
  • 2
  • 4