25

There are nice projects that generate pdf from html/css/js files

  1. http://wkhtmltopdf.org/ (open source)
  2. https://code.google.com/p/flying-saucer/ (open source)
  3. http://cssbox.sourceforge.net/ (not necessarily straight pdf generation)
  4. http://phantomjs.org/ (open source allows for pdf output)
  5. http://www.princexml.com/ (comercial but hands down the best one out there)
  6. https://thepdfapi.com/ a chrome modification to spit pdf from html from

I want to programatically control chrome or firefox browser (because they both are cross platform) to make them load a web page, run the scripts and style the page and generate a pdf file for printing.

But how do I start by controlling the browser in an automated way so that I can do something like

render-to-pdf file-to-render.html out.pdf

I can easily make this job manually by browsing the page and then printing it to pdf and I get an accurate, 100% spec compliant rendered html/css/js page on a pdf file. Even the url headers can be omitted in the pdf through configuration options in the browser. But again, how do I start in trying to automate this process?

I want to automate in the server side, the opening of the browser, navigating to a page, and generating the pdf using the browser rendered page.

I have done a lot of research I just don't know how to make the right question. I want to programatically control the browser, maybe like selenium does but to the point where I export a webpage as PDF (hence using the rendering capabilities of the browser to produce good pdfs)

David Hofmann
  • 5,683
  • 12
  • 50
  • 78
  • 2
    Have you looked at [ChromeDriver](https://code.google.com/p/selenium/wiki/ChromeDriver)? – Chris Haas Aug 29 '14 at 19:51
  • I can't see how to use selenium to tell the browser to export the current page as pdf – David Hofmann Aug 29 '14 at 20:05
  • 2
    You might be able to use a combination of the [Chromium command line args](http://peter.sh/experiments/chromium-command-line-switches/) `--kiosk --kiosk-printing` along with passing the default PDF printer in your [`prefs` capability](https://sites.google.com/a/chromium.org/chromedriver/capabilities#TOC-List-of-recognized-capabilities). I've never tried this but that's where I'd start. – Chris Haas Aug 29 '14 at 20:31
  • 1
    I would think you need to do some real research. IMHO a browser was not intended to do this and you have many hurdles to overcome that you have not thought of (things like possibly running headers/footers, keeping content together over page breaks, differing table headers at page breaks, font handling/special character handling and embedding, understanding that browser dimensions are pixels at 96/inch and many other things are *not* ... I could go on, but that is a start for you. – Kevin Brown Aug 29 '14 at 20:42
  • 2
    @ChrisHaas, $ chrome --kiosk --kiosk-printing file.html, and inside the html I do window.print(); it does excatly what I want, it's just that it still requires me to hit enter to save the file... so sad... Thanks though – David Hofmann Aug 29 '14 at 20:46
  • Answers from this similar question could help you: http://stackoverflow.com/questions/18191893/generate-pdf-from-html-in-div-using-javascript – Kingxlayer Aug 29 '14 at 21:18
  • @KevinBrown, He's not talking about the browser, he's talking about the **rendering engine** the open source browsers use. He only wants the rendering engine, not the whole browser. – Pacerier Aug 14 '15 at 10:34
  • 1
    I think wkhtmltopdf is the closest to what you want. It is a forked version of WebKit built specifically for PDF generation. Alternatively, if you liked Prince, https://docraptor.com is a commercial saas API powered by the Prince engine. – jamespaden Aug 16 '15 at 14:36
  • 1
    "phantomjs.org (open source allows for pdf rasterization)". Instead of "rasterization" I would have written "output" since the PDFs do contain vectors for vector elements like text, borders, etc. – Michael Franzl Jan 21 '17 at 19:10

2 Answers2

6

I'm not an expert but PhamtomJS seems to be the right tool for the job. I'm not sure though about what headless browser it uses underneath (I guess it is chrome/chromium)

var page = require('webpage').create();
page.open('http://github.com/', function() {
     var s = page.evaluate(function() {
         var body = document.body,
             html = document.documentElement;

        var height = Math.max( body.scrollHeight, body.offsetHeight, 
            html.clientHeight, html.scrollHeight, html.offsetHeight );
        var width = Math.max( body.scrollWidth, body.offsetWidth, 
            html.clientWidth, html.scrollWidth, html.offsetWidth );
        return {width: width, height: height}
    });

    console.log(JSON.stringify(s));

    // so it fit ins a single page
    page.paperSize = {
        width: "1980px",
        height: s.height + "px",
        margin: {
            top: '50px',
            left: '20px'
        }
    };

    page.render('github.pdf');
    phantom.exit();
});

Hope it helps.

crodas
  • 346
  • 2
  • 8
  • 3
    CSS allows for page sizing when printing. So setting the papersize doesn't in the code example doesn't help. Besides, there are page breaks too in css print. That being said, I see that PhantomJS uses webkit rendering engine, it's not using a supported browser, instead a fork of webkit (which is ok anyway for this task). But it still requires a lot of work to make it work like princexml. I guess now that is the reason they are not cheap – David Hofmann Aug 30 '14 at 15:07
1

Firefox has an API method for that: https://developer.mozilla.org/en-US/docs/Mozilla/Add-ons/WebExtensions/API/tabs/saveAsPDF

browser.tabs.saveAsPDF({})
  .then((status) => {
    console.log('PDF file status: ' + status);
  });

However, it seems to be available only for Browser Extensions, not to be invoked from a web page.

I'm still looking for a public API for that...

Guillermo Gutiérrez
  • 17,273
  • 17
  • 89
  • 116