2

What I'm trying to do: I want to scrape a web page to get the amount of a financial transaction from a PDF file that is loaded with javascript from a website. Example website: http://www.nebraskadeedsonline.us/document.aspx?g5savSPtTDnumMn1bRBWoKqN6Gu65tBhDE9%2fVs5YdPg=

When I click the 'View Document' button, the PDF file loads into my browser's window (I'm using Google Chrome). I can right-click on the PDF and save it to my computer, but I want to automate that process by either having Selenium (or similar package) download that file and then process it for OCR.

If I can get it saved, I will be able to do the OCR part (I hope). I just can't get the file saved.

From here, I found and modified this code:

def download_pdf(lnk):

    from selenium import webdriver
    from time import sleep

    options = webdriver.ChromeOptions()

    download_folder = "C:\\Users\\rickc\\Documents\\Scraper2\\screenshots\\"

    profile = {"plugins.plugins_list": [{"enabled": False,
                                         "name": "Chrome PDF Viewer"}],
               "download.default_directory": download_folder,
               "download.extensions_to_open": ""}

    options.add_experimental_option("prefs", profile)

    print("Downloading file from link: {}".format(lnk))

    driver = webdriver.Chrome(chrome_options = options)
    driver.get(lnk)

    filename = lnk.split("/")[3].split(".aspx")[0]+".pdf"
    print("File: {}".format(filename))

    print("Status: Download Complete.")
    print("Folder: {}".format(download_folder))

    driver.close()

download_pdf('http://www.nebraskadeedsonline.us/document.aspx?g5savSPtTDnumMn1bRBWoKqN6Gu65tBhDE9fVs5YdPg=')

But it isn't working. My old college professor once said, "If you've spent more than two hours on the problem and haven't made headway, it's time to look for help elsewhere." So I'm looking for help.

Other info: The link above will take you to a web page, but you can't access the PDF document until you click on the 'View Document' button. I've tried using Selenium's webdriver.find_element_by_ID('btnDocument').click() to make things happen, and it just loads the page but doesn't do anything with it.

Rick Colgan
  • 100
  • 1
  • 1
  • 9
  • It's worse than you think. That PDF is essentially a container for an image. To get any meaningful information from it, you're going to have to OCR it. – Robert Harvey Feb 20 '19 at 19:29
  • How to download a pdf file you can find [here](https://stackoverflow.com/questions/50677712/how-to-download-pdf-files-using-selenium-in-python) – Sers Feb 20 '19 at 19:38
  • @Sers -- based on the link you provided, I modified my code to look like this: `profile = {"plugins.plugins_list": [{"enabled": False, "name": "Chrome PDF Viewer"}], "plugins.always_open_pdf_externally": True, "download.default_directory": download_folder, "download.extensions_to_open": ""}` but that's not doing the trick either. Am I just missing something? – Rick Colgan Feb 20 '19 at 20:07

1 Answers1

3

You can download pdf using requests and BeautifulSoup libraries. In code below replace /Users/../aaa.pdf with full path where document will be downloaded:

import requests
from bs4 import BeautifulSoup

url = 'http://www.nebraskadeedsonline.us/document.aspx?g5savSPtTDnumMn1bRBWoKqN6Gu65tBhDE9%2fVs5YdPg='

response = requests.post(url)
page = BeautifulSoup(response.text, "html.parser")

VIEWSTATE = page.select_one("#__VIEWSTATE").attrs["value"]
VIEWSTATEGENERATOR = page.select_one("#__VIEWSTATEGENERATOR").attrs["value"]
EVENTVALIDATION = page.select_one("#__EVENTVALIDATION").attrs["value"]
btnDocument = page.select_one("[name=btnDocument]").attrs["value"]

data = {
  '__VIEWSTATE': VIEWSTATE,
  '__VIEWSTATEGENERATOR': VIEWSTATEGENERATOR,
  '__EVENTVALIDATION': EVENTVALIDATION,
  'btnDocument': btnDocument
}
response = requests.post(url, data=data)
with open('/Users/../aaa.pdf', 'wb') as f:
    f.write(response.content)
Sers
  • 12,047
  • 2
  • 12
  • 31
  • 1
    That was the magic! Thank you @Sers ! I'm not sure I fully understand the code you have there, but it works for me so that I can open the PDF now and use another module to scan for the dollar sign and get the amount I need. – Rick Colgan Feb 20 '19 at 21:54