How to convert webpage into PDF by using Python

Question

I was finding solution to print webpage into local file PDF, using Python. one of the good solution is to use Qt, found here, https://bharatikunal.wordpress.com/2010/01/.

It didn't work at the beginning as I had problem with the installation of PyQt4 because it gave error messages such as 'ImportError: No module named PyQt4.QtCore', and 'ImportError: No module named PyQt4.QtCore'.

It was because PyQt4's not installed properly. I used to have the libraries located at C:\Python27\Lib however it's not for PyQt4.

In fact, it simply needs to download from http://www.riverbankcomputing.com/software/pyqt/download (mind the correct Python version you are using), and install it to C:\Python27 (my case). That's it.

Now the scripts runs fine so I want to share it. for more options in using Qprinter, please refer to http://qt-project.org/doc/qt-4.8/qprinter.html#Orientation-enum.

Note that you can post a Q&A simultaneously if you're self-answering, and the usual quality rules still apply to both parts. — jonrsharpe, Jan 29 '18 at 22:24

score 187 · Answer 1 · edited Jul 31 '20 at 10:10

187

You also can use pdfkit:

Usage

import pdfkit
pdfkit.from_url('http://google.com', 'out.pdf')

Install

MacOS: brew install Caskroom/cask/wkhtmltopdf

Debian/Ubuntu: apt-get install wkhtmltopdf

Windows: choco install wkhtmltopdf

See official documentation for MacOS/Ubuntu/other OS: https://github.com/JazzCore/python-pdfkit/wiki/Installing-wkhtmltopdf

edited Jul 31 '20 at 10:10

answered May 20 '14 at 13:24

NorthCat

9,643
16
47
50

7

This is awesome, way easier than messing around with reportlab or using a print drive to convert. Thanks so much. – Dowlers May 27 '15 at 18:56
@NorthCat can u give another example about converting html tables with pdfkit ? – Babel Aug 12 '15 at 14:44
I want to print it out as pdf in downloadable format. – Hariharan Srinivasan May 09 '17 at 10:19
3

PDFKit requires a running X Server (or "virtual" X Server). :( See here: https://github.com/JazzCore/python-pdfkit/wiki/Using-wkhtmltopdf-without-X-server – Tim Ludwinski Sep 05 '17 at 19:57
1

It seems like windows does not support pdfkit. Is that true? – Kane Chew Nov 15 '17 at 16:47
2

Perfect !! Even download the embeded images, don't bother use that ! You'll have to `apt-get install wkhtmltopdf` – Tinmarino Jan 28 '18 at 21:16
25

pdfkit depends on non-python package wkhtmltopdf, which in turn requires a running X server. So while nice in some environments, this is not an answer that works generally in python. – Rasmus Kaj Feb 19 '18 at 16:33
is there an analogous method to convert from local html files? – 3pitt Dec 14 '18 at 15:32
@Kane Chew - I ran into same problem. – Nguai al Jan 24 '19 at 20:45
@NorthCat pdfkit works fine for small HTML pages. If HTML page have more then 15 pages it will print half of pdf. Any solution that you can suggest ? – Binit Singh Dec 02 '19 at 10:04
for me it only prints the images, but no text at all. – Fábio Feb 12 '20 at 16:52
@NorthCat I am using pdfkit but not able to get the pdf when there is JS in the html file. Can you help with that? – sanyam May 08 '20 at 09:12
@Kane Chew ( tested in 2020 ) wkhtmltopdf can be installed in windows Vista and later without a problem, so PDFkit works as well. – Charalamm Aug 08 '20 at 09:47
on a mac (at least), wkhtmltopdf drops all the html formatting. Also, the procedure to install this tool with homebrew is kind of shaky. – Colin Bernet May 29 '21 at 10:42
very nice! but can we add a table of contents. maybe you can answer this question: https://stackoverflow.com/questions/69146224/create-table-of-contents-with-pdfkit-in-python – yishairasowsky Sep 23 '21 at 12:36
Is there a way to do the same without having to install wkhtmltopdf? Is there a way to use python libraries only? – abc Nov 22 '21 at 08:54
I failed to install wkhtmltopdf for my ubuntu 18, so pdfkit not for me to use. – Nam G VU Feb 22 '22 at 20:07
This package seems to not be maintained anymore... https://github.com/JazzCore/python-pdfkit/issues/242 – Salem Jun 28 '23 at 12:52

score 68 · Answer 2 · edited Aug 12 '20 at 08:48

68

WeasyPrint

pip install weasyprint  # No longer supports Python 2.x.

python
>>> import weasyprint
>>> pdf = weasyprint.HTML('http://www.google.com').write_pdf()
>>> len(pdf)
92059
>>> open('google.pdf', 'wb').write(pdf)

edited Aug 12 '20 at 08:48

Sunit Gautam

5,495
2
18
31

answered Dec 23 '15 at 15:04

JohnMudd

13,607
2
26
24

7

Can I provide file path instead of url? – Piyush S. Wanare Sep 29 '17 at 06:57
22

I think I will prefer this project as it's dependencies are python packages rather than a system package. As of Jan 2018 it seems to have more frequent updates and better documentation. – stvsmth Jan 04 '18 at 23:13
Is there any way to have more complex headers and footers with this? `position: absolute` doesn't duplicate across pages, and `@top-center` only allows plain text with a few gimmicks. – weltensturm Jan 16 '18 at 08:41
14

There are too many things to install. I stopped at libpango and went for the pdfkit. Nasty for system wide wkhtmltopdf but weasyprint also require some system wide installs. – visoft Jul 17 '18 at 08:39
2

this won't convert `javascripts` in the html file. for that you need to use `pdfkit` – suhailvs May 22 '19 at 11:16
2

I would believe the option should be `'wb'`, not `'w'`, because `pdf` is a `bytes` object. – Anatoly Scherbakov Aug 13 '19 at 07:01
1

for me it only downloads the first page and ignore the rest – Fábio Feb 12 '20 at 16:53
Are you able to get the pdf if there is JS in the html document. – sanyam May 08 '20 at 09:12
This printed raw MathJax on the page. Is there a way to render first and then print the page to a PDF? The defualt print page function (Ctrl+P) of my browser (Firefox) indeed rendered it correctly but facing an issue with WeasyPrint – Sunit Gautam Aug 12 '20 at 07:36
1

WeasyPrint works slowly on large HTML pages. In my situation, it was 50s vs 7s (pdfkit) – Artem Bernatskyi Oct 29 '20 at 18:57

score 25 · Accepted Answer · edited May 23 '17 at 12:18

thanks to below posts, and I am able to add on the webpage link address to be printed and present time on the PDF generated, no matter how many pages it has.

Add text to Existing PDF using Python

https://github.com/disflux/django-mtr/blob/master/pdfgen/doc_overlay.py

To share the script as below:

import time
from pyPdf import PdfFileWriter, PdfFileReader
import StringIO
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter
from xhtml2pdf import pisa
import sys 
from PyQt4.QtCore import *
from PyQt4.QtGui import * 
from PyQt4.QtWebKit import * 

url = 'http://www.yahoo.com'
tem_pdf = "c:\\tem_pdf.pdf"
final_file = "c:\\younameit.pdf"

app = QApplication(sys.argv)
web = QWebView()
#Read the URL given
web.load(QUrl(url))
printer = QPrinter()
#setting format
printer.setPageSize(QPrinter.A4)
printer.setOrientation(QPrinter.Landscape)
printer.setOutputFormat(QPrinter.PdfFormat)
#export file as c:\tem_pdf.pdf
printer.setOutputFileName(tem_pdf)

def convertIt():
    web.print_(printer)
    QApplication.exit()

QObject.connect(web, SIGNAL("loadFinished(bool)"), convertIt)

app.exec_()
sys.exit

# Below is to add on the weblink as text and present date&time on PDF generated

outputPDF = PdfFileWriter()
packet = StringIO.StringIO()
# create a new PDF with Reportlab
can = canvas.Canvas(packet, pagesize=letter)
can.setFont("Helvetica", 9)
# Writting the new line
oknow = time.strftime("%a, %d %b %Y %H:%M")
can.drawString(5, 2, url)
can.drawString(605, 2, oknow)
can.save()

#move to the beginning of the StringIO buffer
packet.seek(0)
new_pdf = PdfFileReader(packet)
# read your existing PDF
existing_pdf = PdfFileReader(file(tem_pdf, "rb"))
pages = existing_pdf.getNumPages()
output = PdfFileWriter()
# add the "watermark" (which is the new pdf) on the existing page
for x in range(0,pages):
    page = existing_pdf.getPage(x)
    page.mergePage(new_pdf.getPage(0))
    output.addPage(page)
# finally, write "output" to a real file
outputStream = file(final_file, "wb")
output.write(outputStream)
outputStream.close()

print final_file, 'is ready.'

Thanks for sharing your code! Any advice for making this work for local pdf files? Or is it as easy as prepending "file:///" to the url? I'm not very familiar with these libraries... thanks — sam-6174, Oct 31 '14 at 18:02
@user2426679, you mean convert online PDF into local PDF files? — Mark K, Nov 25 '14 at 01:48
thanks for your reply... sorry for my tardiness. I ended up using wkhtmltopdf since it was able to handle what I was throwing at it. But I was asking how to load a pdf that was local to my hdd. Cheers — sam-6174, Dec 28 '14 at 23:15
@user2426679 sorry I still don't get you. maybe because I am a newbie to Python too. You meant read local PDF files in Python? — Mark K, Jan 22 '15 at 08:05
There were some issues with `html5lib`, which is used by xhtml2pdf. This solution fixed the problem: https://github.com/xhtml2pdf/xhtml2pdf/issues/318 — Blairg23, Oct 14 '16 at 21:22
i dont think this works anymore, `No module named 'pdf'`. iirc PyPdf2 has been deprecated in favor of PyPdf2 or something which is recommended by the authors — 3pitt, Dec 14 '18 at 15:07
@MikePalmice, thank you for the comment. I just tried again the full lines and seems it still works, except 1 of the picture's shape enlarged. (anyway as long as there're better solutions above to this question, why not try them) :) — Mark K, Dec 17 '18 at 01:19

score 15 · Answer 4 · edited Mar 14 '21 at 20:38

Per this answer: How to convert webpage into PDF by using Python, the advice was to use pdfkit. You also have to install wkhtmltopdf.

If you have a local .html file, you then need to use this command:

pdfkit.from_file('test.html', 'out.pdf')

But this will throw an error if you haven't added the wkhtmltopdf executables to your system path. This was the part that tripped me up and I wanted to share.

On Windows, open your environment variables and add them to your System variables > Path like below. In my case, these .exe files were located here after I installed the wkhtmltopdf from an exe:

C:\Program Files\wkhtmltopdf\bin

I was facing the same issue on Win10, this helped, thanks a ton. — kudo_shinichi, Mar 06 '22 at 12:18

score 14 · Answer 5 · edited Oct 26 '19 at 17:32

14

here is the one working fine:

import sys 
from PyQt4.QtCore import *
from PyQt4.QtGui import * 
from PyQt4.QtWebKit import * 

app = QApplication(sys.argv)
web = QWebView()
web.load(QUrl("http://www.yahoo.com"))
printer = QPrinter()
printer.setPageSize(QPrinter.A4)
printer.setOutputFormat(QPrinter.PdfFormat)
printer.setOutputFileName("fileOK.pdf")

def convertIt():
    web.print_(printer)
    print("Pdf generated")
    QApplication.exit()

QObject.connect(web, SIGNAL("loadFinished(bool)"), convertIt)
sys.exit(app.exec_())

edited Oct 26 '19 at 17:32

FractalSpace

5,577
3
42
47

answered Apr 29 '14 at 08:11

Mark K

8,767
14
58
118

Interestingly, the web page links are generated as text rather than links in the generated PDF. – amergin Nov 24 '14 at 17:16
1

Anyone know why this would be generating blank pdfs for me? – boson Oct 04 '16 at 14:33

Jim Paul · Answer 6 · 2015-03-12T13:31:32.973

Here is a simple solution using QT. I found this as part of an answer to a different question on StackOverFlow. I tested it on Windows.

from PyQt4.QtGui import QTextDocument, QPrinter, QApplication

import sys
app = QApplication(sys.argv)

doc = QTextDocument()
location = "c://apython//Jim//html//notes.html"
html = open(location).read()
doc.setHtml(html)

printer = QPrinter()
printer.setOutputFileName("foo.pdf")
printer.setOutputFormat(QPrinter.PdfFormat)
printer.setPageSize(QPrinter.A4);
printer.setPageMargins (15,15,15,15,QPrinter.Millimeter);

doc.print_(printer)
print "done!"

score 9 · Answer 7 · answered Oct 18 '19 at 02:09

I tried @NorthCat answer using pdfkit.

It required wkhtmltopdf to be installed. The install can be downloaded from here. https://wkhtmltopdf.org/downloads.html

Install the executable file. Then write a line to indicate where wkhtmltopdf is, like below. (referenced from Can't create pdf using python PDFKIT Error : " No wkhtmltopdf executable found:"

import pdfkit


path_wkthmltopdf = "C:\\Folder\\where\\wkhtmltopdf.exe"
config = pdfkit.configuration(wkhtmltopdf = path_wkthmltopdf)

pdfkit.from_url("http://google.com", "out.pdf", configuration=config)

where did it go after I clicked .deb and installed on software centre? — mLstudent33, Nov 07 '20 at 23:55

score 6 · Answer 8 · answered Aug 06 '20 at 19:39

6

This solution worked for me using PyQt5 version 5.15.0

import sys
from PyQt5 import QtWidgets, QtWebEngineWidgets
from PyQt5.QtCore import QUrl
from PyQt5.QtGui import QPageLayout, QPageSize
from PyQt5.QtWidgets import QApplication

if __name__ == '__main__':
    app = QtWidgets.QApplication(sys.argv)
    loader = QtWebEngineWidgets.QWebEngineView()
    loader.setZoomFactor(1)
    layout = QPageLayout()
    layout.setPageSize(QPageSize(QPageSize.A4Extra))
    layout.setOrientation(QPageLayout.Portrait)
    loader.load(QUrl('https://stackoverflow.com/questions/23359083/how-to-convert-webpage-into-pdf-by-using-python'))
    loader.page().pdfPrintingFinished.connect(lambda *args: QApplication.exit())

    def emit_pdf(finished):
        loader.page().printToPdf("test.pdf", pageLayout=layout)

    loader.loadFinished.connect(emit_pdf)
    sys.exit(app.exec_())

answered Aug 06 '20 at 19:39

Y.kh

163
2
6

1

I tried this and get this error: Traceback (most recent call last): File "C:/Users/brentond/Documents/Python/PdfWebsite.py", line 2, in from PyQt5 import QtWidgets, QtWebEngineWidgets ImportError: DLL load failed: The specified module could not be found. – Dan Jan 20 '21 at 17:03
1

You have to install the PyQt5 package first: pip install PyQt5 – Y.kh Jan 21 '21 at 16:17
I do have it installed... But as far as I can see there is no PyQt5 method called QtwebEngineWidgets... At least not in 5.15.2 that I have installed in PyCharm. – Dan Jan 29 '21 at 14:43
3

You _also_ need to `pip install PyQtWebEngine` for this to work – Dániel Kis-Nagy Apr 29 '21 at 15:52

score 5 · Answer 9 · edited Aug 16 '23 at 09:37

5

If you use selenium and chromium, you do not need to manage cookies by you self, and you can generate pdf page from chromium's print as pdf. You can refer this project to realize it. https://github.com/maxvst/python-selenium-chrome-html-to-pdf-converter

modified base > https://github.com/maxvst/python-selenium-chrome-html-to-pdf-converter/blob/master/sample/html_to_pdf_converter.py

import sys
import json, base64


def send_devtools(driver, cmd, params={}):
    resource = "/session/%s/chromium/send_command_and_get_result" % driver.session_id
    url = driver.command_executor._url + resource
    body = json.dumps({'cmd': cmd, 'params': params})
    response = driver.command_executor._request('POST', url, body)
    return response.get('value')


def get_pdf_from_html(driver, url, print_options={}, output_file_path="example.pdf"):
    driver.get(url)

    calculated_print_options = {
        'landscape': False,
        'displayHeaderFooter': False,
        'printBackground': True,
        'preferCSSPageSize': True,
    }
    calculated_print_options.update(print_options)
    result = send_devtools(driver, "Page.printToPDF", calculated_print_options)
    data = base64.b64decode(result['data'])
    with open(output_file_path, "wb") as f:
        f.write(data)



# example
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import shutil

# Check for the existence of the chromedriver executable
chromedriver = shutil.which("chromedriver")
assert chromedriver is not None, "chromedriver not on PATH"

url = "https://stackoverflow.com/questions/23359083/how-to-convert-webpage-into-pdf-by-using-python#"
webdriver_options = Options()
webdriver_options.add_argument("--no-sandbox")
webdriver_options.add_argument('--headless')
webdriver_options.add_argument('--disable-gpu')

driver = webdriver.Chrome(chromedriver, options=webdriver_options)
get_pdf_from_html(driver, url)
driver.quit()

edited Aug 16 '23 at 09:37

David Golembiowski

155
3
16

answered Jul 26 '20 at 13:31

Yuanmeng Xiao

194
3
8

1

Firstly i use weasyprint but it do not support cookies even you can write your own `default_url_fetcher` to handle cookies but later i occur issue when install it in Ubuntu16.Then i use wkhtmltopdf it suport cookie setting but it caused many OSERROR like -15 -11 when handle some page. – Yuanmeng Xiao Jul 26 '20 at 13:35
1

Thank you for sharing Mr. @Yuanmeng Xiao. – Mark K Jul 27 '20 at 01:16
1

Hi @YuanmengXiao I copied your code above and I get this error: Traceback (most recent call last): File "C:/Users/brentond/Documents/Python/PdfWebsite.py", line 39, in driver = webdriver.Chrome(chromedriver, options=webdriver_options) NameError: name 'chromedriver' is not defined – Dan Jan 20 '21 at 16:51
I then installed a module called chromedriver and imported it to the above code and now get this error Traceback (most recent call last): File "C:/Users/brentond/Documents/Python/PdfWebsite.py", line 33, in import chromedriver File "C:\Program Files\ArcGIS\Pro\bin\Python\envs\arcgispro-py3\lib\site-packages\chromedriver\__init__.py", line 16, in raise RuntimeError('This package supports only Linux, MacOSX or Windows platforms') RuntimeError: This package supports only Linux, MacOSX or Windows platforms – Dan Jan 20 '21 at 16:56
you should download chromedrver from https://chromedriver.chromium.org/ And you would better learn how to use selenium to driver chrome browser. – Yuanmeng Xiao Jan 21 '21 at 21:59
Thanks Yuanmeng Xiao. I mostly do Python stuff for work and we aren't allowed to download and install extra stuff like this so I was hoping to be able to pdf websites just with a python module within Pycharm. – Dan Jan 29 '21 at 14:48
Verify that either `webdriver_options.binary_location` is assigned the correct path to `chromedriver` or ensure `chromedriver` is the string literal of the path to it. – David Golembiowski Aug 14 '23 at 19:32

score 0 · Answer 10 · answered Sep 27 '22 at 05:21

As explained by another answer; if you have .html files locally you can use the following:

pdfkit.from_file('abc.html', 'abc.pdf')

Additionally, if your source html file has img tags src should be the relative path and you have to include this option to allow local file access.

pdfkit.from_file('abc.html', 'abc.pdf',options={"enable-local-file-access": ""})

Otherwise you may run into the following error

OSError: wkhtmltopdf reported an error: Exit with code 1 due to network error: ProtocolUnknownError

Source: https://github.com/wkhtmltopdf/wkhtmltopdf/issues/2660#issuecomment-663063752

pdfkit error: Exit with code 1 due to network error: ProtocolUnknownError

How to convert webpage into PDF by using Python

10 Answers10

Usage

Install

Linked

Related