PhantomJS (Selenium) Cannot Load PDFs from direct urls

Question

I was trying to save some PDF from a link via PhantomJS (selenium). So, I refered to this code that turns webpages to pdfs. and it worked just fine when I ran the exactly same code.

So, I have this pdf I wanted to save from a direct url and I tried that script... it didn't work. It just saves a PDF with 1 white page. That's all...

My Code :

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By


def execute(script, args):
    driver.execute('executePhantomScript', {'script': script, 'args' : args })

driver = webdriver.PhantomJS('phantomjs')

# hack while the python interface lags
driver.command_executor._commands['executePhantomScript'] = ('POST', '/session/$sessionId/phantom/execute')

driver.get('http://www.planetpublish.com/wp-content/uploads/2011/11/The_Scarlet_Letter_T.pdf')

try:
    WebDriverWait(driver, 40).until(EC.presence_of_element_located((By.ID, 'plugin')))
except Exception as TimeoutException:
    print("I waited for far too long and still couldn't fine the view.")
    pass

# set page format
# inside the execution script, webpage is "this"
pageFormat = '''this.paperSize = {format: "A4", orientation: "portrait" };'''
execute(pageFormat, [])

# render current page
render = '''this.render("test2.pdf")'''
execute(render, [])

I'm not sure what's happening and why is it happening. Need some assistance.

EDIT: This is just the test PDF that I was trying to get via Selenium. There are some other PDFs which I need to get and that website is checking god-knows-what to decide whether it's a human or a bot. So, Selenium is the only way.

EDIT 2 : So, here's the website I was practicing on : http://services.ecourts.gov.in/ecourtindia/cases/case_no.php?state_cd=26&dist_cd=8&appFlag=web

Select "Cr Rev - Criminal Revision" from "Case Type" drop down and input any number in case number and year. Click on "Go".

This will show a little table, click on "view" and it should show a table on full page.

Scroll down to the "orders" table and you should see "Copy of order". That's the pdf I'm trying to get.I have tried requests as well and it did not work.

There are some libraries where you can control the mouse and click with python. You may want to check out some of them. You could do a secondary click to bring up the option "save as" and from there save your file. — whackamadoodle3000, Jul 06 '17 at 19:25
could you please link some of those libraries that does what you mentioned? — Xonshiz, Jul 10 '17 at 12:40
Sure. There is pyautogui http://pyautogui.readthedocs.io/en/latest/. You can use pip it install it. It can move the mouse, click with the mouse, and press keys. You can click on coordinates on your screen or search for and image that you saved to click on — whackamadoodle3000, Jul 10 '17 at 23:39
The link to that documentation: http://pyautogui.readthedocs.io/en/latest/cheatsheet.html#screenshot-functions — whackamadoodle3000, Jul 10 '17 at 23:45
Also, http://services.ecourts.gov.in/ecourtindia/cases/case_no.php?state_cd=26&dist_cd=8&appFlag=web isn't a pdf — whackamadoodle3000, Jul 11 '17 at 16:24
Use `defaults write com.apple.screencapture type pdf` command in terminal to make screenshots pdfs (if you use mac) — whackamadoodle3000, Jul 11 '17 at 21:39
@Xonshiz Please note that PhantomJS doesn't support download a file. Let me know if you can use another browser like Chrome, I have a solution for you. — Buaban, Jul 12 '17 at 06:36
hmm.. I can use headless chrome. Let me know if you have something, wouldn't be a waste to learn something new. — Xonshiz, Jul 13 '17 at 06:58
@Xonshiz Chrome headless doesn't support downloading yet. If you are OK with GUI browser, I can give you the answer. — Buaban, Jul 13 '17 at 08:59
@Xonshiz Please check my answer. By the way, it's quite difficult to find the PDF links on http://services.ecourts.gov.in... . You'd better give more details about value that you've filled in the form e.g. as year, case number. — Buaban, Jul 13 '17 at 09:35

score 0 · Answer 1 · answered Jul 06 '17 at 13:11

0

If you're just looking at downloading PDFs which aren't protected behind some javascript or stuff (essentially straightforward stuff), I suggest using the requests library instead.

import requests
url ='http://www.planetpublish.com/wp-content/uploads/2011/11/The_Scarlet_Letter_T.pdf'
r = requests.get(url)

with open('The_Scarlet_Letter_T.pdf', 'wb') as f:
    f.write(r.content)

# If large file
with requests.get(url, stream=True) as r:
    with open('The_Scarlet_Letter_T.pdf', 'wb') as f:
        for chunk in r.iter_content(chunk_size=1024):
            if chunk:
                f.write(chunk)

answered Jul 06 '17 at 13:11

dbokers

840
1
10
12

1

That's the problem. If the pdfs weren't behind some crazy human/browser checking and stuff, I would've gone with requests. – Xonshiz Jul 06 '17 at 13:31
If you want to try requests I recommend the dryscrape library for the javascript. – whackamadoodle3000 Jul 06 '17 at 19:47
You could also try to reverse engineer the site - perhaps it's not as hard as you think to grab it. Often you can replicate the calls the site makes behind the scenes with requests. You just need to use for instance Chromes inspector tool, Firebug or something similar. What's the name of the site? Let's have a look? – jlaur Jul 06 '17 at 21:32
I tried to do so. I passed all the cookies, headers and checked inspected the website traffic and couldn't really figure out where they are checking for human interaction. – Xonshiz Jul 07 '17 at 07:35
@jlaur I have updated my question with the link. Could you check now? – Xonshiz Jul 08 '17 at 15:15

whackamadoodle3000 · Answer 2 · 2017-07-06T19:44:04.287

0

I recommend you look at the pdfkit library.

import pdfkit
pdfkit.from_url('http://www.planetpublish.com/wp-content/uploads/2011/11/The_Scarlet_Letter_T.pdf', 'out.pdf')

It makes downloading pdfs very simple with python. You will also need to download this for the library to work.

You could also try the code from this link shown below

#!/usr/bin/env python
from contextlib import closing
from selenium.webdriver import Firefox # pip install selenium
from selenium.webdriver.support.ui import WebDriverWait

# use firefox to get page with javascript generated content
with closing(Firefox()) as browser:
     browser.get('http://www.planetpublish.com/wp-content/uploads/2011/11/The_Scarlet_Letter_T.pdf')
     button = browser.find_element_by_name('button')
     button.click()
     # wait for the page to load
     WebDriverWait(browser, timeout=10).until(
         lambda x: x.find_element_by_id('someId_that_must_be_on_new_page'))
     # store it to string variable
     page_source = browser.page_source
print(page_source)

which you will need to edit to make work for your pdf.

edited Jul 06 '17 at 19:44

answered Jul 06 '17 at 19:29

whackamadoodle3000

6,684
4
27
44

The second one isn't working... it just prints ``. – Xonshiz Jul 07 '17 at 07:34
I've updated the question with the actual link. Could you take a look at that? – Xonshiz Jul 08 '17 at 15:15
I'll take a look – whackamadoodle3000 Jul 09 '17 at 00:19
"Cr Rev - Criminal Revision" wasn't in the dropdown when I went to the website. The fact that they are using a captcha means that they don't want bots getting into their website – whackamadoodle3000 Jul 09 '17 at 00:23
You should've mentioned the captcha thing. That makes a difference. To pass this you - unless your OpenCV image enhancing skills are awesome so you can get tesseract to pull the text correctly - you need to use a service like 9kw, 2captcha, deathbycaptha or similar to automate the entire process. But as mentioned above this clearly does not want bots inside... – jlaur Jul 09 '17 at 07:14
captcha isn't the problem. I am filling that in manually. I'm just practicing scraping different websites via selenium. This was one where I got stuck. – Xonshiz Jul 09 '17 at 12:18
Check this out: https://stackoverflow.com/questions/16927090/python-selenium-phantomjs-render-to-pdf?rq=1 – whackamadoodle3000 Jul 11 '17 at 00:51
check my code please, that's the first thing I tried. Well, I guess I'll give up on this for a while and try again later with fresh mindset. Thanks for your input! – Xonshiz Jul 11 '17 at 16:41
I tried your code and it didn't get the page correctly. When I got the page source, it came up with `` just like you said. – whackamadoodle3000 Jul 11 '17 at 16:47
I tried it with the chromedriver and the html loaded. – whackamadoodle3000 Jul 11 '17 at 16:55

score 0 · Accepted Answer · answered Jul 13 '17 at 09:30

Currently, PhantomJS and Chrome headless doesn't support download a file. If you are OK with Chrome browser, please see my example below. It finds a elements, and then add an attribute download. Finally, it clicks on the link to download the file to default Downloads folder.

import time

driver = webdriver.Chrome()
driver.get('http://www.planetpublish.com/free-ebooks/93/heart-of-darkness/')
pdfLinks = driver.find_elements_by_css_selector(".entry-content ul > li > a")
for pdfLink in pdfLinks:
    script = "arguments[0].setAttribute('download',arguments[1]);"
    driver.execute_script(script, pdfLink, pdfLink.text)
    time.sleep(1)
    pdfLink.click()
    time.sleep(3)

driver.quit()

PhantomJS (Selenium) Cannot Load PDFs from direct urls

3 Answers3