Python - Downloading PDFs from ASPX page

Question

I am trying to download the PDF grant reports from a state's department of education public database here: https://mdoe.state.mi.us/cms/grantauditorreport.aspx

I'd like to produce a Python script to go up and download all reports as PDFs for a particular range of dates. There is no option in the page's interface to just download all recipients (schools), so I'm hoping a Python script could loop through all the available selections and download each report individually.

I am very new to Python and have attempted some resources here for people asking similar things, but I have been unsuccessful. So, I do not have any starter code. If someone could give me a start on this, I would greatly appreciate it.

Thanks!

Adam · Answer 1 · 2021-04-10T17:18:01.450

I would recommend Selenium it can be used for webscraping in python.

You will need to install selenium using the instructions provided in the above link. You will also need to install pyautogui (pip install should work).

Note that there were many issues when getting selenium to work in Internet Explorer, if you have problems check out here and here. Because of these issues I had to add in a number of capabilities and define the location of the IEdriver when initializing the selenium webdriver, you will need to change these to match your system. I had originally hoped to use the Chrome or firefox browsers, but the website being scraped only generated the report in internet explorer. As has been noted on other stackexchange boards selenium executes commands much more slowly in Internet Explorer.

Here is code that works with my system and selenium versions:

from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.common.keys import Keys
import time

url = "https://mdoe.state.mi.us/cms/grantauditorreport.aspx"

# LOAD WEBPAGE
capabilities = DesiredCapabilities.INTERNETEXPLORER
capabilities['ignoreProtectedModeSettings'] = True
capabilities['ignoreZoomSetting'] = True
capabilities.setdefault("nativeEvents", False)
driver = webdriver.Ie(capabilities=capabilities,executable_path="C:\\Users\\amcmu\\Downloads\\IEDriverServer_Win32_2.42.0\\IEDriverServer.exe") #include path to IEdriver
driver.get(url)


# SEARCH
search_key = 'public school'
search = driver.find_element_by_name("ctl00$cphMain$txtAgency")
search.send_keys(search_key) # this is the search term
driver.find_element_by_css_selector('.ButtonStandard').click()


# DROPDOWN LIST 
list_item = 'Alpena Public Schools'
select = Select(driver.find_element_by_id('ctl00_cphMain_ddlAgency'))
# select by visible text
select.select_by_visible_text(list_item)
# If you want to iterate through every item in the list use select_by_value
#select.select_by_value('1')


# DATES
start='4/2/2018'
end='4/9/2021'
start_date = driver.find_element_by_name("ctl00$cphMain$txtBeginDate")
driver.execute_script("arguments[0].value = ''", start_date)
start_date.send_keys(start)
end_date = driver.find_element_by_name("ctl00$cphMain$txtEndDate")
driver.execute_script("arguments[0].value = ''", end_date)
end_date.send_keys(end)

# PRESS ENTER TO GENERATE PDF
driver.find_element_by_name('ctl00$cphMain$btnSearch').click()
time.sleep(30) # wait while the server generates the file
print('I hope I waited long enough to generate the file.')

# SAVE FILE
import pyautogui
pyautogui.hotkey('alt','n')
pyautogui.press('tab')
pyautogui.press('enter')


time.sleep(3)
driver.quit()
quit()

Now when the file is being generated you need to wait while their server does its thing, this seemed to take a while (order of 20s). I added the time.sleep(30) to give it 30 seconds to generate the file, you will need to play with this value, or figure out a way to find out when the file has been generated.

I am not sure how you want to iterate through the schools (ie do you have a list of schools with their exact name). If you don't have the list of schools you might want to use something like this pseudocode:

select = Select(driver.find_element_by_name("ctl00$cphMain$ddlAgency"))
options = select.options
for index in range(0, len(options) - 1):
    # DROPDOWN LIST
    select.select_by_index(index)

    # DATES
    do stuff
    # PRESS ENTER TO GENERATE PDF
    do stuff
    # SAVE FILE
    do stuff

I had a question: are you able to download the pdf files. It could be my browser, but when I select search it opens another tab and looks like it is generating the file, but produces a network error. What browser are you using and can you provide a sample school and date range that worked for you. — Adam, Apr 09 '21 at 22:41
Same problem with firefox, where it appears to generate a report, asks if you would like to open or save it, then says the download failed. — Adam, Apr 09 '21 at 22:56
Thank you for your responses! It appears like it only wants to work in IE, none of the other browsers. Can something similar to this be deployed with IE? I seem to have gotten the script to open up the page in IE, but none of the select functions or anything seem to work. — Cheems, Apr 09 '21 at 23:32
Oh sorry I forgot to add the import for the Select functions and that is probably why the select part didn't work. I've updated this answer now. from selenium.webdriver.support.ui import Select — Adam, Apr 10 '21 at 02:09
For anyone having errors for internet explorer with selenium. These are two problems I encountered: [here](https://stackoverflow.com/questions/24925095/selenium-test-with-python-in-internet-explorer) and [here](https://stackoverflow.com/questions/31134408/unable-to-find-element-on-closed-window-on-ie-11-with-selenium). I also had to change browser zoom to 100% in IE settings. — Adam, Apr 10 '21 at 03:07

Python - Downloading PDFs from ASPX page

1 Answers1