Python + Selenium + PhantomJS render to PDF

Question

Is it possible to use PhantomJS's rendering to PDF capabilities when PhantomJS is being used in combination with Selenium and Python? (ie. mimic page.render('file.pdf') behaviour inside Python via Selenium).

I realize that this uses GhostDriver, and GhostDriver doesn't really support much in the way of printing.

If another alternative is possible that isn't Selenium, I'm all ears.

Have you looked at Pypdf2? http://www.blog.pythonlibrary.org/tag/python-pdf-series/ — Amit, Mar 31 '14 at 01:51
@Amit: Rather extensively, as I use it all the time. Even Phaseit themselves have said that "PyPDF2 has no knowledge of HTML". It won't reliably render any HTML. — Rejected, Mar 31 '14 at 17:46
@Rejected do you need the screenshot to occur at an exact state during testing? Or are you just looking to load a page & render to PDF? — Jacob Swartwood, Apr 04 '14 at 18:04

score 11 · Answer 1 · edited Sep 08 '17 at 11:06

Here is a solution using selenium and special command for GhostDriver (it should work since GhostDriver 1.1.0 and PhantomJS 1.9.6, tested with PhantomJS 1.9.8):

#!/usr/bin/env python
# -*- coding: utf-8 -*-

"""Download a webpage as a PDF."""


from selenium import webdriver


def download(driver, target_path):
    """Download the currently displayed page to target_path."""
    def execute(script, args):
        driver.execute('executePhantomScript',
                       {'script': script, 'args': args})

    # hack while the python interface lags
    driver.command_executor._commands['executePhantomScript'] = ('POST', '/session/$sessionId/phantom/execute')
    # set page format
    # inside the execution script, webpage is "this"
    page_format = 'this.paperSize = {format: "A4", orientation: "portrait" };'
    execute(page_format, [])

    # render current page
    render = '''this.render("{}")'''.format(target_path)
    execute(render, [])


if __name__ == '__main__':
    driver = webdriver.PhantomJS('phantomjs')
    driver.get('http://stackoverflow.com')
    download(driver, "save_me.pdf")

see also my answer to the same question here.

Is there any way to do this with Chromedriver and Selenium? Thank you! — jim70, Aug 13 '20 at 20:36

score 1 · Answer 2 · edited Apr 02 '14 at 10:08

1

You could use selenium.selenium.capture_screenshot('file.png') but that will give you a screen shot as a png not a pdf. There does not seem to be a way to get a screenshot as a pdf.

Here are the docs for capture_screenshot: http://selenium.googlecode.com/git/docs/api/py/selenium/selenium.selenium.html?highlight=screenshot#selenium.selenium.selenium.capture_screenshot

edited Apr 02 '14 at 10:08

Ajinkya

22,324
33
110
161

answered Mar 30 '14 at 22:29

KAtkinson

29
2

1

PDF is a key factor. I can't drop down to a simple image for a multitude of reasons, such as text searching, forms, embedded media, etc. – Rejected Mar 31 '14 at 17:31

score 1 · Answer 3 · answered Apr 02 '14 at 10:13

1

Tried pdfkit? It can render PDF files from html pages.

answered Apr 02 '14 at 10:13

moodh

2,661
28
42

I have looked into it as well. PDFKit converts HTML -> PDF, but has no further functionality. Content analysis to determine if a page contains the desired content prior to PDF'ing sadly isn't possible. – Rejected Apr 03 '14 at 15:50
Yeah, I'm having the same issues with PDFKit, I would want abit more advanced rendering, using it with a JS framework is quite the hassle.. :( – moodh Apr 03 '14 at 16:36
"Content analysis to determine if a page contains the desired content" -> Well, can't you do the content analysis yourself and if it matches then you simply send it to render with pdfkit. That's how I would do it. – Jonathan Apr 04 '14 at 08:11
1

@Jonathan: Then I'm not rendering the same page, I'm rendering a second retrieval of it with PDFKit, which re-fetches and re-renders. If I go to PageA and it dynamically generates content, going back to it again means that content can be changed. If I just save the HTML open it locally to convert, I open myself up to a lot of potential problems (relative links, hotlink protection, etc.) – Rejected Apr 04 '14 at 18:36
pdfkit can render from string.. @Rejected – EralpB Aug 18 '19 at 06:13

score 0 · Answer 4 · answered Apr 04 '14 at 18:20

0

@rejected, I know you mentioned not wanting to use subprocesses, but...

You may actually be able to leverage subprocess communication more than you anticipated. Theoretically, you could take Ariya's stdin/stdout example and extend it to be a relatively generic wrapper script. It might first accept a page to load, then listen for (& execute) your test actions on that page. Eventually, you could kick off the .render or even make a generic capture for error handling:

try {
  // load page & execute stdin commands
} catch (e) {
  page.render(page + '-error-state.pdf');
}

answered Apr 04 '14 at 18:20

Jacob Swartwood

4,075
1
16
17

Executing the code received via stdin would need to be done via `eval`, and from my experiences of trying to do this, it's both insecure and unreliable. Unless I'm mistaken? – Rejected Apr 04 '14 at 20:20
While you would want to be cautious with your input (from a reliability perspective), you probably wouldn't have to worry about security since (I'm assuming) you own the process. – Jacob Swartwood Apr 08 '14 at 16:49
You could also white-list specific commands, etc for faster throws on unexpected errors. However, the best scenario I would envision is that you extract your tests (or other logic) that might occur before screen capture into a separate .js file and load that into the page (http://phantomjs.org/api/phantom/method/inject-js.html). You could have Python at a maximum pass an arg for the specific file JS to load. – Jacob Swartwood Apr 08 '14 at 16:55

Python + Selenium + PhantomJS render to PDF

4 Answers4

Linked