1

I need to download a set of individual pdf files from a webpage. It is publicly available by government (ministry of education in Turkey) so totally legal.

However my selenium browser only displays the pdf file, how can I download it and name as I wish.

(This code is also from web)

# Import your newly installed selenium package
from selenium import webdriver
from bs4 import BeautifulSoup


# Now create an 'instance' of your driver
# This path should be to wherever you downloaded the driver
driver = webdriver.Chrome(executable_path="/Users/ugur/Downloads/chromedriver")
# A new Chrome (or other browser) window should open up
download_dir = "/Users/ugur/Downloads/" # for linux/*nix, download_dir="/usr/Public"
options = webdriver.ChromeOptions()

profile = {"plugins.plugins_list": [{"enabled": False, "name": "Chrome PDF Viewer"}], # Disable Chrome's PDF Viewer
               "download.default_directory": download_dir , "download.extensions_to_open": "applications/pdf"}
options.add_experimental_option("prefs", profile)



# Now just tell it wherever you want it to go
driver.get("https://odsgm.meb.gov.tr/kurslar/KazanimTestleri.aspx?sinifid=5&ders=29")
driver.find_element_by_id("ContentPlaceHolder1_dtYillikPlanlar_lnkIndir_2").click()
driver.get("https://odsgm.meb.gov.tr/kurslar/PDFFile.aspx?name=kazanimtestleri.pdf")

Thanks in advance

Extra information:

I had a python 2 code doing this perfectly. But somehow it creates empty files and I couldn't convert it to python 3. Maybe this helps (no offense but I never liked selenium)

import urllib
import urllib2
from bs4 import BeautifulSoup
import os


sinifId=5
maxOrd = 1
fileNames=[]
directory = '/Users/ugur/Downloads/Hasan'
print 'List of current files in directory '+ directory+'\n---------------------------------\n\n'
for current_file in os.listdir(directory):
    if (current_file.find('pdf')>-1 and current_file.find(' ')>-1):
        print current_file
        order = int(current_file.split(' ',1)[0])
        if order>maxOrd: maxOrd=order
        fileNames.append(current_file.split(' ',2)[1])

print '\n\nStarting download \n---------------------------------\n'
ctA=int(maxOrd+1)
for ders in [29]:
    urlSinif='http://odsgm.meb.gov.tr/kurslar/KazanimTestleri.aspx?sinifid='+str(sinifId)+'&ders='+str(ders)

    page = urllib2.urlopen(urlSinif)
    soup = BeautifulSoup(page,"lxml")
    st = soup.prettify()
    count=st.count('ctl00')-1
    dersAdi = soup.find('a', href='/kurslar/CevapAnahtarlari.aspx?sinifid='+str(sinifId)+'&ders='+str(ders)).getText().strip()

    for testNo in range(count):

        if(str(sinifId)+str(ders)+str(testNo+1) in fileNames):
            print str(ctA)+' '+str(sinifId)+str(ders)+str(testNo+1)+' '+dersAdi+str(testNo+1)+'.pdf'+' skipped'    
        else:

            annex=""
            if(testNo%2==1): annex="2"

            eiha_url = u'http://odsgm.meb.gov.tr/kurslar/KazanimTestleri.aspx?sinifid='+str(sinifId)+'&ders='+str(ders)
            data = ('__EVENTTARGET','ctl00$ContentPlaceHolder1$dtYillikPlanlar$ctl'+format(testNo, '02')+'$lnkIndir'+annex), ('__EVENTARGUMENT', '39')

            print 'ctl00$ContentPlaceHolder1$dtYillikPlanlar$ctl'+format(testNo, '02')+'$lnkIndir'+annex

            new_data = urllib.urlencode(data)
            response = urllib2.urlopen(eiha_url, new_data)


            urllib.urlretrieve (str(response.url), directory+'/{0:0>3}'.format(ctA)+' '+str(sinifId)+str(ders)+str(testNo+1)+' '+dersAdi+str(testNo+1)+'.pdf')
            print str(ctA)+' '+str(sinifId)+str(ders)+str(testNo+1)+' '+dersAdi+str(testNo+1)+'.pdf'+' downloaded'
            ctA=ctA+1
Uğur Dinç
  • 303
  • 3
  • 16
  • 1
    Not all sites use `application/pdf` to send a PDF. You have to actually use your browser inspector to check which content type the server is actually sending in your case. – nosklo Sep 24 '18 at 20:11

3 Answers3

3

Add your options before launching Chrome and then specify the chrome_options parameter.

download_dir = "/Users/ugur/Downloads/"
options = webdriver.ChromeOptions()

profile = {"plugins.plugins_list": [{"enabled": False, "name": "Chrome PDF Viewer"}], 
           "download.default_directory": download_dir,
          "download.extensions_to_open": "applications/pdf"}
options.add_experimental_option("prefs", profile)

driver = webdriver.Chrome(
    executable_path="/Users/ugur/Downloads/chromedriver",
    chrome_options=options
)

To answer your second question:

May I ask how to specify the filename as well?

I found this: Selenium give file name when downloading

What I do is:

file_name = ''
while file_name.lower().endswith('.pdf') is False:
    time.sleep(.25)
    try:
        file_name = max([download_dir + '/' + f for f in os.listdir(download_dir)], key=os.path.getctime)
    except ValueError:
        pass
Michael Cox
  • 1,116
  • 2
  • 11
  • 22
2

Here is the code sample I used to download pdf with a specific file name. First you need to configure chrome webdriver with required options. Then after clicking the button (to open pdf popup window), call a function to wait for download to finish and rename the downloaded file.

import os
import time
import shutil

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait

# function to wait for download to finish and then rename the latest downloaded file
def wait_for_download_and_rename(newFilename):
    # function to wait for all chrome downloads to finish
    def chrome_downloads(drv):
        if not "chrome://downloads" in drv.current_url: # if 'chrome downloads' is not current tab
            drv.execute_script("window.open('');") # open a new tab
            drv.switch_to.window(driver.window_handles[1]) # switch to the new tab
            drv.get("chrome://downloads/") # navigate to chrome downloads
        return drv.execute_script("""
            return document.querySelector('downloads-manager')
            .shadowRoot.querySelector('#downloadsList')
            .items.filter(e => e.state === 'COMPLETE')
            .map(e => e.filePath || e.file_path || e.fileUrl || e.file_url);
            """)
    # wait for all the downloads to be completed
    dld_file_paths = WebDriverWait(driver, 120, 1).until(chrome_downloads) # returns list of downloaded file paths
    # Close the current tab (chrome downloads)
    if "chrome://downloads" in driver.current_url:
        driver.close()
    # Switch back to original tab
    driver.switch_to.window(driver.window_handles[0]) 
    # get latest downloaded file name and path
    dlFilename = dld_file_paths[0] # latest downloaded file from the list
    # wait till downloaded file appears in download directory
    time_to_wait = 20 # adjust timeout as per your needs
    time_counter = 0
    while not os.path.isfile(dlFilename):
        time.sleep(1)
        time_counter += 1
        if time_counter > time_to_wait:
            break
    # rename the downloaded file
    shutil.move(dlFilename, os.path.join(download_dir,newFilename))
    return

# specify custom download directory
download_dir = r'c:\Downloads\pdf_reports'

# for configuring chrome pdf viewer for downloading pdf popup reports
chrome_options = webdriver.ChromeOptions()
chrome_options.add_experimental_option('prefs', {
    "download.default_directory": download_dir, # Set own Download path
    "download.prompt_for_download": False, # Do not ask for download at runtime
    "download.directory_upgrade": True, # Also needed to suppress download prompt
    "plugins.plugins_disabled": ["Chrome PDF Viewer"], # Disable this plugin
    "plugins.always_open_pdf_externally": True, # Enable this plugin
    })

# get webdriver with options for configuring chrome pdf viewer
driver = webdriver.Chrome(options = chrome_options)

# open desired webpage
driver.get('https://mywebsite.com/mywebpage')

# click the button to open pdf popup
driver.find_element_by_id('someid').click()

# call the function to wait for download to finish and rename the downloaded file
wait_for_download_and_rename('My file.pdf')

# close the browser windows
driver.quit()

Set timeout (120) to the wait time as per your needs.

ePandit
  • 2,905
  • 2
  • 24
  • 15
0

Non-selenium solution, You can do something like:

import requests
pdf_resp = requests.get("https://odsgm.meb.gov.tr/kurslar/PDFFile.aspx?name=kazanimtestleri.pdf")
with open("save.pdf", "wb") as f:
    f.write(pdf_resp.content)

Although you might want to check the content type before to make sure it's a pdf

Sven Harris
  • 2,884
  • 1
  • 10
  • 20
  • Thanks. However at each click this file changes. (I am not sure why, there is a dopostback function at the page source. So we should start from here https://odsgm.meb.gov.tr/kurslar/KazanimTestleri.aspx?sinifid=5&ders=29 and choose each pdf file and download them :/ – Uğur Dinç Sep 25 '18 at 07:44
  • Yeah looks like it's probably a session related thing rather than a well behaved static URL. You might be able to mimic the behaviour using a post request or request `Session` (you can try tracking the network in your browser, and reverse engineer that) – Sven Harris Sep 25 '18 at 08:05
  • That sounds like above my head :) But thanks a lot for your help. I will keep it in mind for future. – Uğur Dinç Sep 25 '18 at 14:39