0

I'm trying to scrape data about my band's upcoming shows from our agent's web service (such as venue capacity, venue address, set length, set start time ...).

With Python 3.6 and Selenium I've successfully logged in to the site, scraped a bunch of data from the main page, and opened the deal sheet, which is a PDF-like ASPX page. From there I'm unable to scrape the deal sheet. I've successfully switched the Selenium driver to the deal sheet. But when I inspect that page, none of the content is there, just a list of JavaScript scripts.

I tried...

innerHTML = driver.execute_script("return document.body.innerHTML") 

...but this yields the same list of scripts rather than the PDF content I can see in the browser.

I've tried the solution suggested here: Python scraping pdf from URL

But the HTML that solution returns is for the login page, not the deal sheet. My problem is different because the PDF is protected by a password.

Rivers Cuomo
  • 316
  • 2
  • 5
  • 11

2 Answers2

1

You won't be able to read the PDF file using Selenium Python API bindings, the solution would be:

  1. Download the file from the web page using requests library. Given you need to be logged in my expectation is that you might need to fetch cookies from the browser session via driver.get_cookies() command and add them to the request which will download the PDF file
  2. Once you download the file you will be able to read its content using, for instance, PyPDF2
Dmitri T
  • 159,985
  • 5
  • 83
  • 133
0

This 3-part solution works for me:

Part 1 (Get the URL for the password protected PDF)

# with selenium
driver.find_element_by_xpath('xpath To The PDF Link').click()

# wait for the new window to load
sleep(6)

# switch to the new window that just popped up
driver.switch_to.window(driver.window_handles[1])

# get the URL to the PDF
plugin = driver.find_element_by_css_selector("#plugin")        
url = plugin.get_attribute("src")    

The element with the url might be different on your page. Michael Kennedy also suggested #embed and #content.

Part 2 (Create a persistent session with python requests, as described here: How to "log in" to a website using Python's Requests module? . And download the PDF.)

# Fill in your details here to be posted to the login form.
# Your parameter names are probably different. You can find them by inspecting the login page.
payload = {
    'logOnCode': username,
    'passWord': password
}

# Use 'with' to ensure the session context is closed after use.
with requests.Session() as session:
    session.post(logonURL, data=payload)

    # An authorized request.
    f = session.get(url) # this is the protected url
    open('c:/yourFilename.pdf', 'wb').write(f.content)

Part 3 (Scrape the PDF with PyPDF2 as suggested by Dmitri T)

Rivers Cuomo
  • 316
  • 2
  • 5
  • 11