22

Is it possible to use PhantomJS's rendering to PDF capabilities when PhantomJS is being used in combination with Selenium and Python? (ie. mimic page.render('file.pdf') behaviour inside Python via Selenium).

I realize that this uses GhostDriver, and GhostDriver doesn't really support much in the way of printing.

If another alternative is possible that isn't Selenium, I'm all ears.

Mosam Mehta
  • 1,658
  • 6
  • 25
  • 34
Rejected
  • 4,445
  • 2
  • 25
  • 42
  • Have you looked at Pypdf2? http://www.blog.pythonlibrary.org/tag/python-pdf-series/ – Amit Mar 31 '14 at 01:51
  • @Amit: Rather extensively, as I use it all the time. Even Phaseit themselves have said that "PyPDF2 has no knowledge of HTML". It won't reliably render any HTML. – Rejected Mar 31 '14 at 17:46
  • @Rejected do you need the screenshot to occur at an exact state during testing? Or are you just looking to load a page & render to PDF? – Jacob Swartwood Apr 04 '14 at 18:04

4 Answers4

11

Here is a solution using selenium and special command for GhostDriver (it should work since GhostDriver 1.1.0 and PhantomJS 1.9.6, tested with PhantomJS 1.9.8):

#!/usr/bin/env python
# -*- coding: utf-8 -*-

"""Download a webpage as a PDF."""


from selenium import webdriver


def download(driver, target_path):
    """Download the currently displayed page to target_path."""
    def execute(script, args):
        driver.execute('executePhantomScript',
                       {'script': script, 'args': args})

    # hack while the python interface lags
    driver.command_executor._commands['executePhantomScript'] = ('POST', '/session/$sessionId/phantom/execute')
    # set page format
    # inside the execution script, webpage is "this"
    page_format = 'this.paperSize = {format: "A4", orientation: "portrait" };'
    execute(page_format, [])

    # render current page
    render = '''this.render("{}")'''.format(target_path)
    execute(render, [])


if __name__ == '__main__':
    driver = webdriver.PhantomJS('phantomjs')
    driver.get('http://stackoverflow.com')
    download(driver, "save_me.pdf")

see also my answer to the same question here.

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
MTuner
  • 437
  • 1
  • 6
  • 14
1

You could use selenium.selenium.capture_screenshot('file.png') but that will give you a screen shot as a png not a pdf. There does not seem to be a way to get a screenshot as a pdf.

Here are the docs for capture_screenshot: http://selenium.googlecode.com/git/docs/api/py/selenium/selenium.selenium.html?highlight=screenshot#selenium.selenium.selenium.capture_screenshot

Ajinkya
  • 22,324
  • 33
  • 110
  • 161
KAtkinson
  • 29
  • 2
  • 1
    PDF is a key factor. I can't drop down to a simple image for a multitude of reasons, such as text searching, forms, embedded media, etc. – Rejected Mar 31 '14 at 17:31
1

Tried pdfkit? It can render PDF files from html pages.

moodh
  • 2,661
  • 28
  • 42
  • I have looked into it as well. PDFKit converts HTML -> PDF, but has no further functionality. Content analysis to determine if a page contains the desired content prior to PDF'ing sadly isn't possible. – Rejected Apr 03 '14 at 15:50
  • Yeah, I'm having the same issues with PDFKit, I would want abit more advanced rendering, using it with a JS framework is quite the hassle.. :( – moodh Apr 03 '14 at 16:36
  • "Content analysis to determine if a page contains the desired content" -> Well, can't you do the content analysis yourself and if it matches then you simply send it to render with pdfkit. That's how I would do it. – Jonathan Apr 04 '14 at 08:11
  • 1
    @Jonathan: Then I'm not rendering the same page, I'm rendering a second retrieval of it with PDFKit, which re-fetches and re-renders. If I go to PageA and it dynamically generates content, going back to it again means that content can be changed. If I just save the HTML open it locally to convert, I open myself up to a lot of potential problems (relative links, hotlink protection, etc.) – Rejected Apr 04 '14 at 18:36
  • pdfkit can render from string.. @Rejected – EralpB Aug 18 '19 at 06:13
0

@rejected, I know you mentioned not wanting to use subprocesses, but...

You may actually be able to leverage subprocess communication more than you anticipated. Theoretically, you could take Ariya's stdin/stdout example and extend it to be a relatively generic wrapper script. It might first accept a page to load, then listen for (& execute) your test actions on that page. Eventually, you could kick off the .render or even make a generic capture for error handling:

try {
  // load page & execute stdin commands
} catch (e) {
  page.render(page + '-error-state.pdf');
}
Jacob Swartwood
  • 4,075
  • 1
  • 16
  • 17
  • Executing the code received via stdin would need to be done via `eval`, and from my experiences of trying to do this, it's both insecure and unreliable. Unless I'm mistaken? – Rejected Apr 04 '14 at 20:20
  • While you would want to be cautious with your input (from a reliability perspective), you probably wouldn't have to worry about security since (I'm assuming) you own the process. – Jacob Swartwood Apr 08 '14 at 16:49
  • You could also white-list specific commands, etc for faster throws on unexpected errors. However, the best scenario I would envision is that you extract your tests (or other logic) that might occur before screen capture into a separate .js file and load that into the page (http://phantomjs.org/api/phantom/method/inject-js.html). You could have Python at a maximum pass an arg for the specific file JS to load. – Jacob Swartwood Apr 08 '14 at 16:55