140

I need to automatically generate a PDF file from an exisiting (X)HTML-document. The input files (reports) use a rather simple, table-based layout, so support for really fancy JavaScript/CSS stuff is probably not needed.

As I am used to working in Java, a solution that can easily be used in a java-project is preferable. It only needs to work on windows systems, though.

One way to do it that is feasable, but does not produce good quality output (at least out of the box) is using CSS2XSLFO, and Apache FOP to create the PDF files. The problem I encountered was that while CSS-attributes are converted nicely, the table-layout is pretty messed up, with text flowing out of the table cell.

I also took a quick look at Jrex, a Java-API for using the Gecko rendering engine.

Is there maybe a way to grab the rendered page from the internet explorer rendering engine and send it to a PDF-Printer tool automatically? I have no experience in OLE programming in windows, so I have no clue what's possible and what is not.

Do you have an idea?

Willi Mentzel
  • 27,862
  • 20
  • 113
  • 121
panschk
  • 3,228
  • 3
  • 24
  • 20
  • 3
    I've recently created a Java library [docbag](http://docbag.org) that can convert xhtml to pdf documents. Current version is not anything advanced, but if your xhtml templates are simple this library may come handy. – Jakub Torbicki Oct 08 '12 at 14:05
  • I think the way to go is to use the browsers capabilities to do the translation. See http://stackoverflow.com/q/25574082/39998 – David Hofmann Aug 29 '14 at 19:46
  • I am stuck with generating pdf from a html that contains Cyrillic letters. Everything's fine except Cyrillic letters which are omitted. Anyone who got this kinda problem? – Kristijan Iliev Jan 02 '15 at 20:00
  • @krisiliev: I had similar issues, and as far as I can remember, the font used was very important. Most fonts do not support complete UTF8 characters, but the following should: ' font-family: Arial Unicode MS;' (CSS). Also make sure to use the correct encoding (I would advise to always use UTF-8) – panschk Jan 28 '15 at 14:23
  • 2
    this linked helped me http://hmkcode.com/itext-html-to-pdf-using-java/ – Mateen Oct 06 '16 at 18:43
  • This question is off-topic at SO, but on-topic in softwarerecs.SE. See [How can I convert HTML with CSS to PDF?](https://softwarerecs.stackexchange.com/q/45903/1834). – Martin Thoma Sep 21 '17 at 15:21
  • @Jakub Torbicki you posted a broken link ,it does not work for me ! – Menai Ala Eddine - Aladdin Mar 09 '18 at 14:51
  • How would be the answer today in 2020? I suggest that one would use Print CSS and then use a modern HTML2PDF engine do produce the binary PDF output to be sent to the client's browser? – basZero Jul 16 '20 at 09:07

8 Answers8

79

The Flying Saucer XHTML renderer project has support for outputting XHTML to PDF. Have a look at an example here.

davidlj95
  • 124
  • 7
Mark
  • 28,783
  • 8
  • 63
  • 92
  • 26
    The real problem with flying sauser is that it uses itext to render PDF, which is a AGPL v3 licenced lib – David Hofmann Nov 27 '12 at 14:29
  • 14
    The version of itext used by Flying Saucer is 2.0.8 which was available under LGPL. Only version numbers 5 or above are on the more restrictive license. http://stackoverflow.com/questions/2692000/can-i-use-a-previous-version-of-itextsharp-under-the-lgpl – Gary - Stand with Ukraine Feb 13 '13 at 14:53
  • 9
    I'd say the real problem with Flying Saucer is that it requires a well-formed and valid XML document. It's easy to unwittingly break the PDF rendering by including something like an ampersand in your HTML, or some javascript code that makes your rendered HTML not strict XHTML. Though this can be mitigated with automated tests or some process that involves XML validation. – SteveT Jun 19 '13 at 13:43
53

Did you try WKHTMLTOPDF?

It's a simple shell utility, an open source implementation of WebKit. Both are free.

We've set a small tutorial here

EDIT( 2017 ):

If it was to build something today, I wouldn't go that route anymore.
But would use http://pdfkit.org/ instead.
Probably stripping it of all its nodejs dependencies, to run in the browser.

Mic
  • 24,812
  • 9
  • 57
  • 70
  • 17
    For a straight html-page-to-pdf conversion, this is better than anything else I've seen, free or commercial. – MGOwen Nov 01 '09 at 23:08
  • Does it work on a non Mac OS? – Eran Medan Mar 26 '11 at 01:55
  • 1
    @Eran, we use it on linux. I think there's a windows version too – Mic Mar 28 '11 at 09:39
  • 1
    @Mic Yes, there is a Windows version too. – Viccari Mar 14 '12 at 16:30
  • tested on windows XP (version 0.9.9) and works very well. Also, does not require admin rights on the machine to install. – Christopher Mahan May 23 '13 at 23:28
  • why can't we use the real browser for that instead of the fork of the (now unmantained) rendering engine ? See http://stackoverflow.com/q/25574082/39998 – David Hofmann Aug 29 '14 at 19:47
  • @DavidHofmann, probably because this question dates back to 2009. From the last check I did few months ago, there was still no comparable solution in JS – Mic Sep 02 '14 at 12:16
  • How would this work in a threaded Enterprise environment that would be generating several hundred pdf files a minute? – IcedDante Nov 07 '14 at 18:51
  • @IcedDante, what makes you think there would be a problem? – Mic Nov 08 '14 at 21:13
  • I guess what I am wondering is if this shell utility creates its own memory space for each invocation or if it operates like a utility in headless mode where each thread would be using a shared resource – IcedDante Nov 08 '14 at 22:10
  • @IcedDante, we have a similar load of pdf as yours, but we queue them in a background job, to preserve server performances. And run them one by one. However if I remember well, in the beginning we made some tests, and there was no collision on concurrent calls. – Mic Nov 11 '14 at 12:59
  • i love you for this reference. great utility – Jossef Harush Kadouri Jan 04 '16 at 11:46
  • It's JavaScript, not Java.... – Cardinal System Sep 28 '18 at 21:14
  • @CardinalSystem it's neither JS nor Java, just a command line tool over the library of WKHTMLTOPDF written in c – Mic Oct 01 '18 at 15:14
  • For many simple cases , I still do recommend using a wkhtmltopdf binary – kommradHomer Aug 23 '19 at 06:44
  • Can confirm wkhtmltopdf is a great tool, and easy to use. I've been using it for years and still use it frequently. – Kenny Cason Nov 25 '20 at 17:40
  • From Java, you can use https://github.com/wooio/htmltopdf-java which is a wrapper around wkhtmltopdf – Daniel Jun 27 '21 at 16:45
  • @Danielany may I ask, if you have any experience using it in a web server environment? I mean I think, it won't play nicely with a web server spawning new process for each client request. – ayan ahmedov Mar 20 '22 at 10:28
  • @ayanahmedov, yes we do that for about 13 years now, on an Ubuntu server with nginx – Mic Mar 21 '22 at 11:05
47

Check out iText; it is a pure Java PDF toolkit which has support for reading data from HTML. I used it recently in a project when I needed to pull content from our CMS and export as PDF files, and it was all rather straightforward. The support for CSS and style tags is pretty limited, but it does render tables without any problems (I never managed to set column width though).

Creating a PDF from HTML goes something like this:

Document doc = new Document(PageSize.A4);
PdfWriter.getInstance(doc, out);
doc.open();
HTMLWorker hw = new HTMLWorker(doc);
hw.parse(new StringReader(html));
doc.close();
dStulle
  • 609
  • 5
  • 24
fred-o
  • 1,342
  • 8
  • 12
  • 9
    It's AGPL, seems even worse than GPL, you need to be open source even if you just serve the PDF and iText is server side. – Eran Medan Mar 26 '11 at 01:54
  • 10
    @Eran, Just use the last non-AGPL version (com.lowagie:itext:2.1.7 in Maven). – Nowaker Apr 20 '11 at 15:11
  • 1
    HTMLWorker is deprecated in newer versions of IText in favor of XMLWorker; however CSS support is poor in both cases (see http://demo.itextsupport.com/xmlworker/itextdoc/CSS-conformance-list.htm) and was not adequate for my needs. On the contrary Flying Saucer was perfect. – Pino Nov 12 '13 at 09:35
  • You may use LGPL version which could be found at https://github.com/albfernandez/itext2 – Vova Rozhkov Sep 12 '16 at 11:58
  • HTMLWorker supports very simple HTML documents, with basic elements and no CSS. It is too limited to be useful. But the more recent iText html2pdf works really great https://kb.itextpdf.com/home/it7kb/ebooks/itext-7-converting-html-to-pdf-with-pdfhtml/chapter-1-hello-html-to-pdf – Emmanuel Bourg May 06 '21 at 08:07
4

If you have the funding, nothing beats Prince XML as this video shows

Ólafur Waage
  • 68,817
  • 22
  • 142
  • 198
  • 1
    If you're looking for a cheaper alternative for Prince, try DocRaptor.com. It uses Prince as the engine. – Julie Jan 19 '11 at 01:49
  • And if you want to cheaper, but with more options, try http://www.htm2pdf.co.uk - it uses webkit and users real WYSIWIG – user1914292 Apr 29 '13 at 07:30
3

Is there maybe a way to grab the rendered page from the internet explorer rendering engine and send it to a PDF-Printer tool automatically?

This is how ActivePDF works, which is good means that you know what you'll get, and it actually has reasonable styling support.

It is also one of the few packages I found (when looking a few years back) that actually supports the various page-break CSS commands.


Unfortunately, the ActivePDF software is very frustrating - since it has to launch the IE browser in the background for conversions it can be quite slow, and it is not particularly stable either.

There is a new version currently in Beta which is supposed to be much better, but I've not actually had a chance to try it out, so don't know how much of an improvement it is.

Peter Boughton
  • 110,170
  • 32
  • 120
  • 176
  • 1
    Thanks for the helpful answer. I don't think ActivePDF is really suitable because of the price, but it's good to know something like that exists. – panschk Mar 11 '09 at 11:05
  • GrabzIt's HTML to PDF API: https://grabz.it/html-to-pdf-image-api.aspx Works in the same way it renders the HTML in a browser and then creates the PDF this ensures that there is much more accurate PDF conversions. – user1474090 Jan 13 '17 at 15:38
2

You can use a headless firefox with an extension. It's pretty annoying to get running but it does produce good results.

Check out this answer for more info.

Community
  • 1
  • 1
rojoca
  • 11,040
  • 4
  • 45
  • 46
  • Doesnt sound like a very scalable solution if one needs to convert pages on the fly to pdf in parallel. If a few requests come thru that result in a conversion using FF your server will have lost a few GIG of memory just to serve a few converted pages. This would open your server to a DOS. – mP. Apr 12 '11 at 00:09
  • Better but similar: https://github.com/ariya/phantomjs/wiki/Screen-Capture (according to http://we-love-php.blogspot.com/2012/12/create-pdf-invoices-with-html5-and-phantomjs.html the pdf has real text, not rasterized) – nafg Oct 25 '13 at 02:05
0

Amyuni WebkitPDF could be used with JNI for a Windows-only solution. This is a HTML to PDF/XAML conversion library, free for commercial and non-commercial use.

If the output files are not needed immediately, for better scalability it may be better to have a queue and a few background processes taking items from there, converting them and storing then on the database or file system.

usual disclaimer applies

yms
  • 10,361
  • 3
  • 38
  • 68
0

If you look at the side bar of your question, you will see many related questions...

In your context, the simpler method might be to install a PDF print driver like PDFCreator and just print the page to this output.

PhiLho
  • 40,535
  • 6
  • 96
  • 134
  • How is this a Java solution? This is a windows print driver. – Gray Mar 07 '16 at 14:51
  • The OP explicitly mentioned Windows. And I suppose there are similar drivers for other systems. The OP only mentioned Java as a possible solution... – PhiLho Mar 07 '16 at 15:05