6

Is there any way to use Python to create PDF documents from HTML/CSS/Javascript, without introducing any OS-level dependencies?

It seems every existing solution requires special supplemental software, but upon reviewing PDF formatting specifications and HTML/CSS/Javascript rendering, there doesn't appear to be a reason why a Python solution can't exist without them. Some solutions come close, such as pyppeteer, but it still leans on a headless Chrome installation locally. These dependencies mean that microservices can't be leveraged, even though PDF generation would otherwise seem to be a viable use case for them.

While similar questions have come up many times over on SO, there doesn't appear to have been a viable technique shown without having to install specialized dependencies on the OS.

Some similar questions which routinely recommend wkhtmltopdf or are otherwise out of date (e.g., moving PDF printing support outside of Chrome is dead now):

If I've somehow missed a viable approach, please feel free to mark this as a duplicate with my thanks!

Edit February 2021: It appears that the cefpython project may meet these demands - PDF printing support seems like it could be implemented in the near future.

bsplosion
  • 2,641
  • 27
  • 38
  • Can ship to an external webservice? – user2864740 Aug 05 '19 at 17:27
  • 1
    Anything external wouldn't really be in Python at that point - I'd say code golf rules in this case, where fetching the solution from a remote service would be invalid. – bsplosion Aug 05 '19 at 17:30
  • What if you use ctypes to call a c/c++ function from a dll that does all the magic, like numpy and tensorflow does? If that's valid, this is a library: https://github.com/galkahana/PDF-Writer/wiki I can add further how to do that and a sample as an answer if it's still valid. – Nosvan Aug 05 '19 at 17:50
  • @Nosvan, that'd be interesting! If it's possible to do this exclusively from code and imports (no OS interaction at all), then it's a contender. The [how-to site](https://pdfhummus.com/How-To) doesn't appear to mention rendering - how would you propose we get the objects into a format which PDF-Writer could then use? – bsplosion Aug 05 '19 at 20:07
  • It's tricky to actually render html pages. You need a renderer, but most of them are headless or full fledged browsers. The simplest one that I found was this one https://github.com/litehtml/litehtml. It is only a html parser with an interface. It accepts only css and html. Together with the pdf creator, the way to do this would be implementing the render interface with the pdf creator method, and finally a python wrapper. – Nosvan Aug 06 '19 at 13:16
  • This basically means that I couldn't find a ready-to-use solution with your requeriments, and this would be the way to implement one, by connecting two c++ libraries (a renderer and a pdf creator) and writting a wrapper in python. I hope you find an easier and ready-to-use solution. – Nosvan Aug 06 '19 at 13:18
  • Thanks for looking into it! Sounds like the best option remains [pyppeteer](https://github.com/miyakogi/pyppeteer) for now, which comes very close since it bundles the version of Chromium it requires when `pip install pyppeteer` is run. – bsplosion Aug 06 '19 at 16:00
  • To clarify a bit further what others have said: if you want it to render javascript, it must include a javascript engine. There is no ECMAscript compliant javascript engine written in pure python that is well-maintained (that would be a huge project)... So you will always need native code as most HTML renderers and javascript engines are usually developed in C++. These two things are the major part of a browser, so a headless browser is a good solution to this requirement. – reverse_engineer Nov 29 '20 at 07:30
  • @reverse_engineer that's all fair, and perhaps that's really the answer - there is no JavaScript engine written in Python, and there doesn't appear to be much incentive to do so either judging from some abandoned projects from years past. If you want to rephrase your comment as an answer, that seems acceptable. – bsplosion Nov 29 '20 at 19:25
  • OK, I wrote it as an answer! – reverse_engineer Nov 30 '20 at 10:06

2 Answers2

3

Try this library: xhtml2pdf

It worked for me. Here is the documentation: doc

Some sample code:

from xhtml2pdf import pisa             

def convert_html_to_pdf(source_html, output_filename):
    # open output file for writing (truncated binary)
    result_file = open(output_filename, "w+b")

    # convert HTML to PDF
    pisa_status = pisa.CreatePDF(
            source_html,                # the HTML to convert
            dest=result_file)           # file handle to recieve result

    # close output file
    result_file.close()                 # close output file

    # return False on success and True on errors
    return pisa_status.err

# Define your data
source_html = open('2020-06.html')
output_filename = "test.pdf"
convert_html_to_pdf(source_html, output_filename)
riffraff
  • 2,429
  • 1
  • 23
  • 32
Acaelesto
  • 31
  • 3
  • 1
    Thanks for the suggestion! Unfortunately, `xhtml2pdf` does not support any form of Javascript. I've updated the title of the question to make that requirement more clear - it was only mentioned in the question body previously. – bsplosion Jul 15 '20 at 19:54
3

So to clarify and formalize what others have said:

  • If you want to create PDF documents from HTML/CSS/javascript content, you will necessarily need a javascript engine (because you obviously need to execute the javascript if it affects the visuals of the document). This is the most complex component that you need.

  • As for now, there is no ECMAscript compliant engine written in pure python that is well-maintained (that would be a huge project)... There will probably never be one, since compilers and VMs for languages need to be performant and are thus usually written in a performant low-level language.

  • So you will always need compiled binaries for that and the HTML renderers which are less complex but also need to be performant if used in browsers, so usually they're also C++ or the likes.

  • The javascript engine and HTML renderer are the major part of a browser, so a headless browser is a good solution to this requirement.

reverse_engineer
  • 4,239
  • 4
  • 18
  • 27