19

According to the docs, Chrome can be started in headless mode with --print-to-pdf in order to export a PDF of a web page. This works well for pages accessible with a GET request.

Trying to find a print-to-pdf solution that would allow me to export a PDF after executing multiple navigation request from within Chrome. Example: open google.com, input a search query, click the first result link, export to PDF.

Looking at the [very limited amount of available] docs and samples, I failed to find a way to instruct Chrome to export a PDF, after a page loads. I'm using the Java chrome-driver.

One possible solution not involving Chrome, is by using a tool like wkhtmltopdf. Going on this path would force me to - before sending the HTML to the tool - do the following:

  • save the HTML in a local file
  • traverse the DOM, and download all file links (images, js, css, etc)

Don't prefer this path as it would require a lot of tinkering [I assume] on my part to get downloads' file paths correct for wkhtmltopdf to read correctly.

Is there a way to instruct Chrome to print to PDF, but only after a page loads?

jankovd
  • 1,681
  • 1
  • 16
  • 22
  • Can you share your code trials? – undetected Selenium Nov 21 '17 at 07:42
  • Nothing to share as I got nowhere with my attempts. But can explain my process. It involved basically trying any [Chrome preferences](https://src.chromium.org/viewvc/chrome/trunk/src/chrome/common/pref_names.cc?view=markup) that made sense at that time to me, in order to force Chrome to print to PDF after `window.print()` is executed. Looked also at the [command line switches](https://peter.sh/experiments/chromium-command-line-switches/) but those were of no help to me also. – jankovd Nov 21 '17 at 09:58

4 Answers4

9

This is indeed possible to do through Selenium Chromedriver, by means of the ExecuteChromeCommandWithResult method. When executing the command Page.printToPDF, a base-64-encoded PDF document is returned in the "data" item of the result dictionary.

A C# example, which should be easy to translate into Java, is available in this answer:

https://stackoverflow.com/a/58698226/2416627

Here is another C# example, which illustrates some useful options:

public static void Main(string[] args)
{
    var driverOptions = new ChromeOptions();
    // In headless mode, PDF writing is enabled by default (tested with driver major version 85)
    driverOptions.AddArgument("headless");
    using (var driver = new ChromeDriver(driverOptions))
    {
        driver.Navigate().GoToUrl("https://stackoverflow.com/questions");
        new WebDriverWait(driver, TimeSpan.FromSeconds(10)).Until(d => d.FindElements(By.CssSelector("#questions")).Count == 1);
        // Output a PDF of the first page in A4 size at 90% scale
        var printOptions = new Dictionary<string, object>
        {
            { "paperWidth", 210 / 25.4 },
            { "paperHeight", 297 / 25.4 },
            { "scale", 0.9 },
            { "pageRanges", "1" }
        };
        var printOutput = driver.ExecuteChromeCommandWithResult("Page.printToPDF", printOptions) as Dictionary<string, object>;
        var pdf = Convert.FromBase64String(printOutput["data"] as string);
        File.WriteAllBytes("stackoverflow-page-1.pdf", pdf);
    }
}

The options available for the Page.printToPDF call are documented here:

https://chromedevtools.github.io/devtools-protocol/tot/Page/#method-printToPDF

Otto G
  • 670
  • 7
  • 14
4

As there are no answers, I will explain my workaround. Instead of trying to find how to request from Chrome to print the current page, I went down another route.

For this example we will try to download the results page from Google on the query 'example':

  1. Navigate with driver.get("google.com"), input the query 'example', click 'Google Search'
  2. Wait for the results page to load
  3. Retrieve the page source with driver.getPageSource()
  4. Parse source with e.g. Jsoup in order to remap all relative links to point to an endpoint defined for this purpose (explained below) - example to localhost:8080. Link './style.css' would become 'localhost:8080/style.css'
  5. Save HTML to a file, e.g. named 'query-example'
  6. Run chrome --print-to-pdf localhost:8080/search?id=query-example

What will happen is that chrome will request the HTML from our controller, and for resources defined in the HTML we return, it will go to our controller - since we remapped relative links - which will in turn forward that request to the real location of the resource - google.com. Below is an example Spring controller, and note that the example is incomplete and is here only as a guidance.

@RestController
@RequestMapping
public class InternationalOffloadRestController {
  @RequestMapping(method = RequestMethod.GET, value = "/search/html")
  public String getHtml(@RequestParam("id") String id) {
    File file = new File("location of the HTML file", id);
    try (FileInputStream input = new FileInputStream(file)) {
      return IOUtils.toString(input, HTML_ENCODING);
    }
  }
  @RequestMapping("/**") // forward all remapped links to google.com
  public void forward(HttpServletResponse httpServletResponse, ...) {
    URI uri = new URI("https", null, "google.com", -1, 
      request.getRequestURI(), request.getQueryString(), null);
    httpServletResponse.setHeader("Location", uri.toString());
    httpServletResponse.setStatus(HttpServletResponse.SC_MOVED_PERMANENTLY);
  }
}
Diego Montania
  • 322
  • 5
  • 12
jankovd
  • 1,681
  • 1
  • 16
  • 22
  • Any update on this? It seems odd that this isn't an available command on every navigation. – N-ate Jul 18 '18 at 19:07
  • I've done it with a shell script... download the page_source, grep out the links to css and images, download those to directory, sed to point links to the local files, then wkhtmltopdf – J. Win. Aug 21 '18 at 06:43
  • 1
    @J.Win. please consider sharing your script, someone visiting might find it useful. – jankovd Aug 23 '18 at 09:36
  • What if there's a dubious style.css that's like 20GB on the server? Haha you'd promptly be defeated. – TheRealChx101 Sep 16 '20 at 07:22
2

An example for doing this from command line, takes a little tinkering with the page html and sed:

LOGIN='myuserid'
PASSW='mypasswd'
AUTH='pin=$LOGIN&accessCode=$PASSW&Submit=Submit'
TIMESTAMP=`TZ=HST date -d "today" +"%m/%d/%y %I:%M %p HST"`
wget -q --save-cookies cookies.txt --keep-session-cookies \
    --post-data $AUTH \
    https://csea.ehawaii.gov/iwa/index.html
sed -i 's#href="/iwa/css#href="./bin#g' index.html
sed -i 's#src="/iwa/images#src="./bin#g' index.html
wkhtmltopdf -q --print-media-type \
            --header-left "$d" --header-font-size 10 \
            --header-line --header-spacing 10 \
            --footer-left "Page [page] of [toPage]" --footer-font-size 10 \
            --footer-line --footer-spacing 10 \
            --footer-right "$TIMESTAMP" \
            --margin-bottom 20 --margin-left 15 \
            --margin-top 20 --margin-right 15 \
            index.html index.pdf

Assuming valid cookies, further pages available after login could be accessed like this:

wget -q --load-cookies cookies.txt https://csea.ehawaii.gov/otherpage.html
wkhtmltopdf <all the options> otherpage.html otherpage.pdf

Also, I had previously dumped all the css and images in a local bin directory, something like this:

wget -r -A.jpg -A.gif -A.css -nd -Pbin \
    https://csea.ehawaii.gov/iwa/index.html
J. Win.
  • 6,662
  • 7
  • 34
  • 52
2

Using ChromiumDriver from Java Selenium 4.x.x release, this can be achieved.

String command = "Page.printToPDF";
Map<String, Object> params = new HashMap<>();
params.put("landscape", false);
Map<String, Object> output = driver.executeCdpCommand(command, params);
try {
    FileOutputStream fileOutputStream = new FileOutputStream("export.pdf");
    byte[] byteArray = Base64.getDecoder().decode((String)output.get("data"));
    fileOutputStream.write(byteArray);
} catch (IOException e) {
    e.printStackTrace();
}

Source: Selenium_CDP

Smile
  • 3,832
  • 3
  • 25
  • 39