How can I automate HTML-to-PDF conversions?

Question

I've been using htmldoc for a while, but I've run into some fairly serious limitations. I need the end solution to work on a Linux box. I'll be calling this library/utility/application from a Perl app, so any Perl interfaces would be a bonus.

score 74 · Answer 1 · edited Dec 11 '15 at 03:38

74

Sorry to unearth this old post, but it came out first in my search for the best HTML/PDF conversion tool. On Linux wkhtmltopdf is very good (takes into account CSS, among others) and GPL.

edited Dec 11 '15 at 03:38

jamis

247
3
10

answered May 07 '10 at 09:29

Alexandre

741
5
2

To Support your point , 1)works like a charm 2)uses the webkit rendering engine, and qt which means it can benefit from updates . Though last RC was released Feb 2011. – kommradHomer Apr 11 '13 at 13:16
To update on @kommradHomer's comment, the project is still active; the latest stable version was released just last month. It is also available in Ubuntu official repositories, but at the time of writing a few versions behind. – Arild Mar 01 '14 at 12:22
phantomjs is another possibility, also based on webkit, if you want to fetch remote pages and convert to pdf. It can do many other things too, such as scraping using javascript and the DOM. – Sam Watkins May 20 '14 at 04:15
1

`wkhtmltopdf` is awesome. but it doesnt support flex box styling. – phil294 Mar 31 '17 at 19:54
it fails on some platform with this error: "Could not connect to display" – ierdna May 28 '17 at 16:46
4

Like the docs say, you need static version with patched Qt to run it without an X server. – Lev Levitsky Jul 06 '17 at 21:02
you can use it with a "virtual x-server" - see [this question](https://stackoverflow.com/questions/9604625/wkhtmltopdf-cannot-connect-to-x-server) – Nathan Chappell Nov 16 '20 at 11:54

andrew-e · Answer 2 · 2017-11-05T23:56:01.860

23

WeasyPrint produces nice PDFs with selectable text and hyperlinks.

weasyprint input.html output.pdf

If you use wkhtmltopdf instead, try the following options:

wkhtmltopdf --margin-bottom 20mm --margin-top 20mm --minimum-font-size 16 ...

edited Nov 05 '17 at 23:56

answered Apr 28 '16 at 20:24

andrew-e

724
7
10

3

This should be the selected answer, it is free, open-source, and yes, results are phenomenal! Highly recommended. – FlorianB Jul 09 '16 at 22:49
And to set tiny margins: `weasyprint docs.html docs.pdf -s <(echo '@page { margin: 0.5cm; }')` – Michał Jaroń Dec 22 '20 at 00:15

Roben · Answer 3 · 2020-02-04T08:58:39.643

Update 2019-05

The whole process has thankfully been packed into a docker image by TheCodingMachine: https://github.com/thecodingmachine/gotenberg

This makes maintenance and usage of chrome based pdf generation in production environments really smooth and hassle free.

There is a new headless mode since Chrome 59. As all the other solutions really struggle with newer (or not so new anymore) CSS features like flexbox, this was in my case the only solution to produce a proper PDF output.

To create a pdf from a local html file just use the following command: chrome --headless --disable-gpu --print-to-pdf file:///path/to/myfile.html.

For Mac OS substitue chrome with /Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome.

The only downside I noticed so far is that (currently) you can not pass the html via stdin, but creating a temporary file is not that much of an issue.

For more information see https://developers.google.com/web/updates/2017/04/headless-chrome#create_a_pdf_dom

Update: As it turns out, the chrome guys will most likely provide some kind of node module for this task, which would eventually deprecate the headless mode (https://bugs.chromium.org/p/chromium/issues/detail?id=719921).

The best bet would be to use the node based approach using the puppeteer module as documented under https://developers.google.com/web/updates/2017/04/headless-chrome#node and print the page via the Page.printToPDF command, which enables some additional configuration, too.

Of course, you can connect to the debug console websocket from any other environment than node (i.e. PHP script), too.

I’ve tried tried weasyprint (does not support display: grid, also a bit slow), then wkhtmltopdf (nearly a GB, couldn’t get it to work on Ubuntu/WSL) then puppeteer. Puppeteer works, is fast, allows Selenium-like automation, works on pages which load content via JS, etc. Thanks for your suggestion! — Felipe Cortez, Mar 04 '19 at 12:08

Orion Edwards · Accepted Answer · 2020-04-08T01:22:35.287

8

NOTE: This answer is from 2008 and is probably now incorrect; please check the other answers

PrinceXML is the best one I've seen (it parses regular HTML as well as XML/XHTML). How is it the best? Well, it passes the acid2 test which I thought was pretty darn impressive

It is however, quite expensive

edited Apr 08 '20 at 01:22

answered Oct 07 '08 at 01:54

Orion Edwards

121,657
64
239
328

2

I've had this same problem. I've recently evaluated Prince XML and can vouch for it being a SERIOUSLY awesome app. The speed and quality of the output is simply unbelievable. – cletus Jan 13 '09 at 01:41
Why pay many thousands of dollars when a free and open-source software that also passes the Acid 2 test is available? http://weasyprint.readthedocs.io WeasyPrint highly recommended. Phenomenal results. – FlorianB Jul 09 '16 at 22:51
Passing the acid2 test was seriously impressive in 2008 when I made that answer. In 2016, or today? Not so much, but I imagine prince has come a long way since then too – Orion Edwards Apr 09 '19 at 11:45
A fair answer at the time. But out of date now. – Daniel Winterstein Apr 07 '20 at 17:59

score 7 · Answer 5 · answered Oct 07 '08 at 01:38

I did a bit of googling for you and came up with two options. There may be more, my google strategy was to try "webkit command-line pdf" and "gecko command-line pdf", basically looking for commandline programs that embed the two popular open-source rendering engines in command-line renderers. Here's what I found:

Firefox command-line printer - outputs to pdf and png

wkpdf - while this is for mac, it's probably pretty portable.

score 3 · Answer 6 · answered Oct 06 '08 at 22:40

3

I wont claim this is the "best" solution but it is "a" solution i have used.

HTML Input --> HTML 2 PS --> PS 2 PDF --> PDF Output

answered Oct 06 '08 at 22:40

Declan Shanaghy

2,314
2
18
19

score 3 · Answer 7 · edited Jun 20 '20 at 09:12

You can install the free Calibre, and use the ebook-convert command line utility it has, to convert many html documents into a single epub, or pdf.

https://manual.calibre-ebook.com/generated/en/ebook-convert.html

Idea comes from here

I haven't used it, but this npm module wraps this process up like my following bash script, but probably better ;-)

For me, on my mac, I use the following bash script to convert a local html website to a PDF:

convert_html_to_pdf.sh

function show_help()
{
  ME=$(basename $0)
  IT=$(cat <<EOF
  
  Converts an html file to pdf, epub, mobi or more if you look!

  usage: input.html output.{pdf|epub|mobi}
  
  e.g. 
  
  $ME index.html output.pdf 

  Note: Requires Calibre be installed. more info here: https://ebooks.stackexchange.com/a/6285
EOF
  )
  echo "$IT"
  exit
}

if [ "$1" == "help" ]
then
  show_help
fi
if [ "$1" == "--help" ]
then
  show_help
fi

/Applications/calibre.app/Contents/MacOS/ebook-convert $1 $2 --max-levels=1

score 2 · Answer 8 · answered Oct 06 '08 at 22:45

2

This would be total overkill, but you could download and install mirth. It is a message routing engine, but it has the ability to convert html to pdf, so you could set it up to pick up an html file in a folder, convert to pdf, and drop the pdf in the same or other folder. Like I said, overkill, a bit of a learning curve, but it's free, and java so you can run it on linux if you like. And all your perl app would have to do is drop the html to a file.

answered Oct 06 '08 at 22:45

Jeremy

44,950
68
206
332

The `mirth` project seems to be dead, this answer should probably be deleted. – Hashim Aziz Jun 05 '20 at 16:36
It's had several renames since, but certainly not dead. It's now NextGen connect integration engine. – Jeremy Jun 05 '20 at 20:28

score 1 · Answer 9 · answered Mar 29 '15 at 21:58

1

You should have a look at http://phantomjs.org/

Conversion can be done by a small script rasterize.js and then issuing

phantomjs rasterize.js 'http://en.wikipedia.org/w/index.php?title=Jakarta&printable=yes' jakarta.pdf

answered Mar 29 '15 at 21:58

MrTux

32,350
30
109
146

score 1 · Answer 10 · answered Sep 18 '18 at 22:02

I have found Electroshot to be supportive of modern CSS features, particularly layout. This was after struggling with wkhtmltopdf showing its age in not supporting things like CSS3.

From Electroshot's features description:

Electroshot uses Electron, which offers the most recent stable version of Chrome (rather than one from years ago); this means that pages render as they would in a browser...

I've been able to use Bootstrap 4 to design a page, and then use Electroshot to render a PDF very closely resembling the HTML/CSS.

score 1 · Answer 11 · answered Apr 09 '19 at 11:54

An alternative solution that hasn't been answered here is to use an API.

That advantage of them is that you externalize the resources needed for the job and have an up-to-date service that implements the recent features (no needs to update the code or install bugfixes).

For instance, with PDFShift, you can do that with a single POST request at:

POST https://api.pdfshift.io/v2/convert/

And passing the "source" (either an URL or a raw HTML code), and you'll get back a PDF in binary. (Disclaimer: I work at PDFShift).

Here's a code sample in Python:

import requests

response = requests.post(
    'https://api.pdfshift.io/v2/convert/',
    auth=('user_api_key', ''),
    json={"source": "https://en.wikipedia.org/wiki/PDF", "landscape": False, "use_print": False}
)

response.raise_for_status()

with open('wikipedia.pdf', 'wb') as f:
    f.write(response.content)

And your PDF will be located at ./wikipedia.pdf

Daniel Winterstein · Answer 12 · 2020-04-08T19:12:28.713

Here is a nice easy-to-install version of headless Chrome:

https://www.npmjs.com/package/chrome-headless-render-pdf

Unlike "standard" headless chrome, this does not show the annoying auto-generated headers and footers!

Or there is unoconv (which uses LibreOffice behind the scenes) can make pdfs from html:

unoconv -f pdf mypage.html

You can install it on most Linux flavours via the package manager, e.g. apt-get install unoconv

That's nice and easy for simple files. If you need javascript of css support, then use headless Chrome.

Lance · Answer 13 · 2020-10-08T02:05:58.623

0

I have started to put together a tool to provide a simplified interface to common actions.

You can convert an HTML to a PDF like this:

$ npm install @lancejpollard/act -g
$ act convert tmp/index.html -o tmp/index.pdf -w 2000px -h 3000px

This will create a new PDF for the HTML file.

If nothing else check out the source and see how to write your own script to do this in JavaScript.

edited Oct 08 '20 at 02:05

answered Jul 23 '20 at 02:41

Lance

75,200
93
289
503

score 0 · Answer 14 · answered Dec 27 '20 at 07:31

wkhtmltopdf and wkhtmltoimage are open source (LGPLv3) command line tools to render HTML into PDF and various image formats using the Qt WebKit rendering engine. These run entirely "headless" and do not require a display or display service.

How to use it?

Download a precompiled binary or build from source https://wkhtmltopdf.org/downloads.html

https://github.com/wkhtmltopdf/wkhtmltopdf
Create your HTML document that you want to turn into a PDF (or image)
Run your HTML document through the tool.

Usage: wkhtmltopdf input.html output_name.pdf

score 0 · Answer 15 · answered Apr 10 '21 at 13:25

I often get very good results when using the ebook-convert command line tool that ships with Calibre.

ebook-convert <input.html> <output.pdf>

Check the numerous options for tweaking in the manual. For example, it is possible to automatically generate a table of contents based on H1/H2/... headlines (or anything using XPath expressions, basically).

Please note: Calibre focuses on digital documents and I don't know how well ebook-convert works for very complicated HTML. Worth a try though. :-)

score -1 · Answer 16 · answered Mar 04 '14 at 17:30

You might want to check out 'Document Conversion Service' by Peernet (at http://www.peernet.com/conversion-software/batch-document-converter/). This runs as a service on a Windows Desktop or Windows Server machine. It opens HTML documents in a web browser, then prints them through a print driver to create PDF documents, so that the PDF document produced looks exactly as if you had printed the HTML document from the browser.

How can I automate HTML-to-PDF conversions?

16 Answers16

convert_html_to_pdf.sh

Linked