I want to convert Shiny for Python document into pdf. Jumping to each section and then printing into pdf is possible. However, wondering if there is a more compact way to print all sections in a one go.
Asked
Active
Viewed 307 times
0
-
Do you want to print one time? Do you want to do something like add a button and be able to print on demand? If one time, do you want to use the console in developer tools? – Kat Dec 30 '22 at 01:42
-
I'm not sure I understand what you're going for. in any case, you could `document.querySelectorAll("a[href*=https]").forEach(x=> (console.log(x.href)))` to get all URLs (not recursively, obviously), then go to every url with selenium, wait for the element to load (whatever way you want), and [screenshot](https://www.geeksforgeeks.org/screenshot-element-method-selenium-python/) or [save as pdf](https://stackoverflow.com/questions/56897041/how-to-save-opened-page-as-pdf-in-selenium-python). perhaps remove unwanted elements before doing that. – Yarin_007 Dec 30 '22 at 17:49
1 Answers
1
I can propose a solution based on wkhtmltopdf
and python (to scrape the links of html files for different sections of the docs and pass them to pdfkit
, a python library which is a wrapper for wkhtmltopdf
utility to convert HTML to PDF.
So at first download the wkhtmltopdf
and then install this tool on your system (you may read this to get help about installation process and if you are a windows user remember to add wkhtmltopdf
to PATH).
Then check its availability from cmd/shell by,
$ wkhtmltopdf --version
# wkhtmltopdf 0.12.6 (with patched qt)
Now then install these python libraries (assuming you have python installed),
pip install requests beautifulsoup4 pdfkit
and then run this python script,
$ python html2pdf.py
html2pdf.py
import re
import pdfkit
import requests
from bs4 import BeautifulSoup
# Making a GET request
r = requests.get('https://shiny.rstudio.com/py/docs/get-started.html')
# print(r.status_code)
# Parsing the HTML
soup = BeautifulSoup(r.content, 'html.parser')
a = soup.find_all('a', class_='sidebar-link')
# get the links
links = [link.get('href') for link in a if link.get('href') is not None]
site_link = 'https://shiny.rstudio.com/py'
full_links = [site_link + link[2:] for link in links]
# for file names
names = [re.findall("(?:.+\/)(.+)(?:.html)", link)[0] for link in full_links]
# convert the link of htmls to pdf
for i, link in enumerate(full_links):
pdfkit.from_url(link, f"{names[i]}.pdf")
It will convert all the html files (links in the sidebar of https://shiny.rstudio.com/py/docs/) into pdf files in one go.
$ ls
get-started.pdf reactive-programming.pdf ui-navigation.pdf
html2pdf.py reactive-values.pdf ui-page-layouts.pdf
overview.pdf running-debugging.pdf ui-static.pdf
putting-it-together.pdf server.pdf user-interface.pdf
reactive-calculations.pdf ui-dynamic.pdf workflow-modules.pdf
reactive-events.pdf ui-feedback.pdf workflow-server.pdf
reactive-mutable.pdf ui-html.pdf

shafee
- 15,566
- 3
- 19
- 47
-
(+1) Thanks @shafee for very useful answer. The given script prints all sections, however, the print quality is not good. – MYaseen208 Jan 02 '23 at 01:24
-
Actually that would be the problem with automated printing. and i think even when you try to print those manually you dont have the full control over the quality. It depends on how the website author arranged things. – shafee Jan 02 '23 at 01:28
-
On a second thought, assuming the html files are rendered from qmd files (my guess), a script can be writtern to grab those qmd files and render them to pdf using quarto. But unfortunately I am having trouble to find the github repo for these qmd source file. – shafee Jan 02 '23 at 04:17