116

I am working on creating PDF from web page.

The application on which I am working is single page application.

I tried many options and suggestion on https://github.com/GoogleChrome/puppeteer/issues/1412

But it is not working

    const browser = await puppeteer.launch({
    executablePath: 'C:\\Program Files (x86)\\Google\\Chrome\\Application\\chrome.exe',
    ignoreHTTPSErrors: true,
    headless: true,
    devtools: false,
    args: ['--no-sandbox', '--disable-setuid-sandbox']
});

const page = await browser.newPage();

await page.goto(fullUrl, {
    waitUntil: 'networkidle2'
});

await page.type('#username', 'scott');
await page.type('#password', 'tiger');

await page.click('#Login_Button');
await page.waitFor(2000);

await page.pdf({
    path: outputFileName,
    displayHeaderFooter: true,
    headerTemplate: '',
    footerTemplate: '',
    printBackground: true,
    format: 'A4'
});

What I want is to generate PDF report as soon as Page is loaded completely.

I don't want to write any type of delays i.e. await page.waitFor(2000);

I can not do waitForSelector because the page has charts and graphs which are rendered after calculations.

Help will be appreciated.

i.brod
  • 3,993
  • 11
  • 38
  • 74
n.sharvarish
  • 1,423
  • 4
  • 13
  • 13
  • I have tried all suggested solutions. With Node.js puppeteer nothing worked. I switched to a Python script to do load the HTML, wait some seconds for the JS to load external elements / generate graphs, and then generate the PDF. – W.M. Feb 09 '23 at 19:02

15 Answers15

126

You can use page.waitForNavigation() to wait for the new page to load completely before generating a PDF:

await page.goto(fullUrl, {
  waitUntil: 'networkidle0',
});

await page.type('#username', 'scott');
await page.type('#password', 'tiger');

await page.click('#Login_Button');

await page.waitForNavigation({
  waitUntil: 'networkidle0',
});

await page.pdf({
  path: outputFileName,
  displayHeaderFooter: true,
  headerTemplate: '',
  footerTemplate: '',
  printBackground: true,
  format: 'A4',
});

If there is a certain element that is generated dynamically that you would like included in your PDF, consider using page.waitForSelector() to ensure that the content is visible:

await page.waitForSelector('#example', {
  visible: true,
});
Grant Miller
  • 27,532
  • 16
  • 147
  • 165
  • 4
    Where is the documentation for the signal 'networkidle0'? – Chilly Code Aug 27 '19 at 17:31
  • 8
    'networkidle0' is documented here https://github.com/GoogleChrome/puppeteer/blob/master/docs/api.md#pagegotourl-options – diegoubi Oct 04 '19 at 20:31
  • 1
    Should `page.waitForSelector` be called after `page.goto` or before? Could you answer a similar question I asked https://stackoverflow.com/questions/58909236/pupeteer-script-does-not-wait-for-the-selector-to-get-loaded-and-i-get-a-blank-h ? – Amanda Nov 18 '19 at 07:01
  • 3
    Why would I use networkidle0 when I could use the default load event? Is it faster to use networkidle0? – Gary Feb 06 '21 at 09:29
  • If you're clicking something that triggers navigation, there's a race condition if [`Promise.all isn't used`](https://github.com/puppeteer/puppeteer/blob/main/docs/api.md#pageclickselector-options), e.g. `Promise.all([page.click(...), page.waitForNavigation(...)])` – ggorlen Mar 16 '22 at 03:23
  • @Gary See [this comment](https://github.com/puppeteer/puppeteer/issues/1666#issuecomment-354224942) by a (former) core Puppeteer developer. – ggorlen Mar 16 '22 at 03:59
  • I had same Problems. I use format: "A4". My Solution was not to use the scale (<1) option. – Bergi Jun 14 '22 at 11:25
  • page.waitForNavigation() & page.waitForSelector() link are dead – Lucas Bodin Oct 29 '22 at 12:03
  • > - `networkidle0` : consider navigation to be finished when there are no more than 0 network connections for at least `500` ms. > https://pptr.dev/api/puppeteer.page.goto/#remarks > https://pptr.dev/api/puppeteer.puppeteerlifecycleevent/ – Nor.Z Feb 20 '23 at 08:31
109

Sometimes the networkidle events do not always give an indication that the page has completely loaded. There could still be a few JS scripts modifying the content on the page. So watching for the completion of HTML source code modifications by the browser seems to be yielding better results. Here's a function you could use -

const waitTillHTMLRendered = async (page, timeout = 30000) => {
  const checkDurationMsecs = 1000;
  const maxChecks = timeout / checkDurationMsecs;
  let lastHTMLSize = 0;
  let checkCounts = 1;
  let countStableSizeIterations = 0;
  const minStableSizeIterations = 3;

  while(checkCounts++ <= maxChecks){
    let html = await page.content();
    let currentHTMLSize = html.length; 

    let bodyHTMLSize = await page.evaluate(() => document.body.innerHTML.length);

    console.log('last: ', lastHTMLSize, ' <> curr: ', currentHTMLSize, " body html size: ", bodyHTMLSize);

    if(lastHTMLSize != 0 && currentHTMLSize == lastHTMLSize) 
      countStableSizeIterations++;
    else 
      countStableSizeIterations = 0; //reset the counter

    if(countStableSizeIterations >= minStableSizeIterations) {
      console.log("Page rendered fully..");
      break;
    }

    lastHTMLSize = currentHTMLSize;
    await page.waitForTimeout(checkDurationMsecs);
  }  
};

You could use this after the page load / click function call and before you process the page content. e.g.

await page.goto(url, {'timeout': 10000, 'waitUntil':'load'});
await waitTillHTMLRendered(page)
const data = await page.content()
Arel
  • 3,888
  • 6
  • 37
  • 91
Anand Mahajan
  • 1,209
  • 1
  • 8
  • 6
  • 15
    I'm not sure why this answer hasn't gotten more "love". In reality, a lot of the time we really just need to make sure JavaScript is done messing with the page before we scrape it. Network events don't accomplish this, and if you have dynamically generated content, there isn't always something you can reliably do a "waitForSelector/visible:true" on – Jason May 14 '20 at 14:35
  • Thanks @roberto - btw I just updated the answer, you could use this with the 'load' event rather than 'networkidle2' . Thought it would be little more optimal with that. I have tested this in production and can confirm it works well too! – Anand Mahajan Sep 20 '20 at 16:08
  • I tried to put the `checkDurationMsecs` to 200ms, and the bodyHTMLSize keep changing, and give huge numbers, I am using electron and rect also, very strange. – Ambroise Rabier Apr 29 '21 at 13:15
  • Ok I found that ridiculous hard to catch bug. If your luck manage to catch that 100k long html page, you realize there are CSS classes like `CodeMirror`, must be https://codemirror.net/ , meaning.... `document.body.innerHTML` is catching the dev console too ! Just remove `mainWindow.webContents.openDevTools();` for e2e testing. I hope don't get any more bad surprise. – Ambroise Rabier Apr 29 '21 at 13:41
  • This was the answer for rendering large HTML files out to PDF using Puppeteer and PagedJS. The pagedJS polyfill was still faffing with the content once network traffic stopped so the pdf request was kicking off without all of the content rendered. Thank you. – Mike Smith Feb 03 '23 at 10:14
53

In some cases, the best solution for me was:

await page.goto(url, { waitUntil: 'domcontentloaded' });

Some other options you could try are:

await page.goto(url, { waitUntil: 'load' });
await page.goto(url, { waitUntil: 'domcontentloaded' });
await page.goto(url, { waitUntil: 'networkidle0' });
await page.goto(url, { waitUntil: 'networkidle2' });

You can check this at puppeteer documentation: https://pptr.dev/#?product=Puppeteer&version=v11.0.0&show=api-pagewaitfornavigationoptions

Eduardo Conte
  • 1,145
  • 11
  • 18
  • 3
    This doesn't ensure that any scripts loaded have finished executing. Therefore HTML could still be rendering and this would proceed. – AbuZubair Nov 23 '20 at 01:12
  • 3
    For those who are confused by these options, `domcontentloaded` is the first one to fire, so you generally use it when you want to move on with your script before any external resources load. Typically, this is because you don't want data from them. `load`, `networkidle2` and `networkidle0` offer different flavors of waiting for resources in roughly increasing strictness, but none of them provide an exact guarantee that "the page is loaded" (because this varies from site to site, so it's ill-defined in general). – ggorlen Jun 07 '22 at 18:09
  • Does this work with `page.click` ? – FreelanceConsultant Feb 05 '23 at 20:53
  • domcontentloaded - worked for me also. tks! – Mr Special Aug 17 '23 at 10:42
39

I always like to wait for selectors, as many of them are a great indicator that the page has fully loaded:

await page.waitForSelector('#blue-button');
Nicolás A.
  • 523
  • 4
  • 10
  • You are a genius, this is such an obvious solution, especially when you are waiting for specific elements, and as soon as I did not guess myself, thank you! – Arch4Arts Feb 28 '21 at 12:54
  • @Arch4Arts you should create your own clicking function that does the waiting for you as well as clicking – Nicolás A. Apr 02 '21 at 20:01
11

In the latest Puppeteer version, networkidle2 worked for me:

await page.goto(url, { waitUntil: 'networkidle2' });
attacomsian
  • 2,667
  • 23
  • 24
10

Wrap the page.click and page.waitForNavigation in a Promise.all

  await Promise.all([
    page.click('#submit_button'),
    page.waitForNavigation({ waitUntil: 'networkidle0' })
  ]);
Mark Swardstrom
  • 17,217
  • 6
  • 62
  • 70
  • 2
    `page.waitForNavigation({ waitUntil: 'networkidle0' })` is this same as `page .waitForNetworkIdle()`? – milos Oct 20 '21 at 11:50
6

I encountered the same issue with networkidle when I was working on an offscreen renderer. I needed a WebGL-based engine to finish rendering and only then make a screenshot. What worked for me was a page.waitForFunction() method. In my case the usage was as follows:

await page.goto(url);
await page.waitForFunction("renderingCompleted === true")
const imageBuffer = await page.screenshot({});

In the rendering code, I was simply setting the renderingCompleted variable to true, when done. If you don't have access to the page code you can use some other existing identifier.

Dharman
  • 30,962
  • 25
  • 85
  • 135
Tali Oat
  • 121
  • 1
  • 5
5

Answers so far haven't mentioned a critical fact: it's impossible to write a one-size-fits-all waitUntilPageLoaded function that works on every page. If it were possble, Puppeteer would surely provide it.

Such a function can't rely on a timeout, because there's always some page that takes longer to load than that timeout. As you extend the timeout to reduce the failure rate, you introduce unnecessary delays when working with fast pages. Timeouts are generally a poor solution, opting out of Puppeteer's event-driven model.

Waiting for idle network requests might not always work if the responses involve long-running DOM updates that take longer than 500ms to trigger a render.

Waiting for the DOM to stop changing might miss slow network requests, long-delayed JS triggers, or ongoing DOM manipulation that might cause the listener never to settle, unless specially handled.

And, of course, there's user interaction: captchas, prompts and cookie/subscription modals that need to be clicked through and dismissed before the page is in a sensible state for a full-page screenshot (for example).

Since every page has different, arbitrary JS behavior, the typical approach is to write event-driven logic that works for a specific page. Making precise, directed assumptions is much better than cobbling together a boatload of hacks that tries to solve every edge case.

If your use case is to write a load event that works on every page, my suggestion is to use some combination of the tools described here that is most balanced to meet your needs (speed vs. accuracy, development time/code complexitiy vs accuracy, etc). Use fail-safes for everything rather than blindly assuming all pages will cooperate with your assumptions. Think hard about what extent you really need to try to handle every web page. Prepare to compromise and accept some degree of failures you can live with.


Here's a quick rundown of the strategies you can mix and match to wait for loads to fit your needs:

page.goto() and page.waitForNavigation() default to the load event, which "is fired when the whole page has loaded, including all dependent resources such as stylesheets and images" (MDN), but this is often too pessimistic; there's no need to wait for a ton of data you don't care about. Often the data is available without waiting for all external resources, so domcontentloaded should be faster. See my post Avoiding Puppeteer Antipatterns for further discussion.

On the other hand, if there are JS-triggered networks requests after load, you'll miss that data. Hence networkidle2 and networkidle0, which wait 500 ms after the number of active network requests are 2 or 0. The motivation for the 2 version is that some sites keep ongoing requests open, which would cause networkidle0 to time out.

If you're waitng for a specific network response that might have a payload (or, for the general case, implementing your own network idle monitor), use page.waitForResponse(). page.waitForRequest(), page.waitForNetworkIdle() and page.on("request", ...) are also useful here.

If you're waiting for a particular selector to be visible, use page.waitForSelector(). If you're waiting for a load on a specific page, identify a selector that indicates the state you want to wait for. Generally speaking, for scripts specific to one page, this is the main tool to wait for the state you want, whether you're extracting data or clicking something. Frames and shadow roots thwart this function.

page.waitForFunction() lets you wait for an arbitrary predicate, for example, checking that the page's HTML or a specific list is a certain length. It's also useful for quickly dipping into frames and shadow roots to wait for predicates that depend on nested state. This function is also handy for detecting DOM mutations.

The most general tool is page.evaluate(), which plugs code into the browser. You can put just about any conditions you want here; most other Puppeteer functions are convenience wrappers for common cases you could implement by hand with evaluate.

ggorlen
  • 44,755
  • 7
  • 76
  • 106
4

You can also use to ensure all elements have rendered

await page.waitFor('*')

Reference: https://github.com/puppeteer/puppeteer/issues/1875

Phat Tran
  • 3,404
  • 1
  • 19
  • 22
  • 3
    `waitFor` is deprecated and will be removed in a future release. See https://github.com/puppeteer/puppeteer/issues/6214 for details and how to migrate your code. – kenberkeley Dec 16 '20 at 23:59
4

As for December 2020, waitFor function is deprecated, as the warning inside the code tell:

waitFor is deprecated and will be removed in a future release. See https://github.com/puppeteer/puppeteer/issues/6214 for details and how to migrate your code.

You can use:

sleep(millisecondsCount) {
    if (!millisecondsCount) {
        return;
    }
    return new Promise(resolve => setTimeout(resolve, millisecondsCount)).catch();
}

And use it:

(async () => {
    await sleep(1000);
})();
Or Assayag
  • 5,662
  • 13
  • 57
  • 93
  • 10
    just use page.waitForTimeout(1000) – Viacheslav Dobromyslov Dec 10 '20 at 15:41
  • 5
    The github issue states that they just deprecated the "magic" waitFor function. You can still use one of the specific waitFor*() functions. Hence your sleep() code is needless. (Not to mention that it’s overcomplicated for what it does, and it’s generally a bad idea to tackle concurrency problems with programmatic timeouts.) – lxg Dec 20 '20 at 13:20
3

Keeping in mind the caveat that there's no silver bullet to handle all page loads, one strategy is to monitor the DOM until it's been stable (i.e. has not seen a mutation) for more than n milliseconds. This is similar to the network idle solution but geared towards the DOM rather than requests and therefore covers a different subset of loading behaviors.

Generally, this code would follow a page.waitForNavigation({waitUntil: "domcontentloaded"}) or page.goto(url, {waitUntil: "domcontentloaded"}), but you could also wait for it alongside, say, waitForNetworkIdle() using Promise.all() or Promise.race().

Here's a simple example:

const puppeteer = require("puppeteer"); // ^14.3.0

const waitForDOMStable = (
  page,
  options={timeout: 30000, idleTime: 2000}
) =>
  page.evaluate(({timeout, idleTime}) =>
    new Promise((resolve, reject) => {
      setTimeout(() => {
        observer.disconnect();
        const msg = `timeout of ${timeout} ms ` +
          "exceeded waiting for DOM to stabilize";
        reject(Error(msg));
      }, timeout);
      const observer = new MutationObserver(() => {
        clearTimeout(timeoutId);
        timeoutId = setTimeout(finish, idleTime);
      });
      const config = {
        attributes: true,
        childList: true,
        subtree: true
      };
      observer.observe(document.body, config);
      const finish = () => {
        observer.disconnect();
        resolve();
      };
      let timeoutId = setTimeout(finish, idleTime);
    }),
    options
  )
;

const html = `<!DOCTYPE html><html lang="en"><head>
<title>test</title></head><body><h1></h1><script>
(async () => {
  for (let i = 0; i < 10; i++) {
    document.querySelector("h1").textContent += i + " ";
    await new Promise(r => setTimeout(r, 1000));
  }
})();
</script></body></html>`;

let browser;
(async () => {
  browser = await puppeteer.launch({headless: true});
  const [page] = await browser.pages();
  await page.setContent(html);
  await waitForDOMStable(page);
  console.log(await page.$eval("h1", el => el.textContent));
})()
  .catch(err => console.error(err))
  .finally(() => browser?.close())
;

For pages that continually mutate the DOM more often than the idle value, the timeout will eventually trigger and reject the promise, following the typical Puppeteer fallback. You can set a more aggressive overall timeout to fit your needs or tailor the logic to ignore (or only monitor) a particular subtree.

ggorlen
  • 44,755
  • 7
  • 76
  • 106
1

I can't leave comments, but I made a python version of Anand's answer for anyone who finds it useful (i.e. if they use pyppeteer).

async def waitTillHTMLRendered(page: Page, timeout: int = 30000): 
    check_duration_m_secs = 1000
    max_checks = timeout / check_duration_m_secs
    last_HTML_size = 0
    check_counts = 1
    count_stable_size_iterations = 0
    min_stabe_size_iterations = 3

    while check_counts <= max_checks:
        check_counts += 1
        html = await page.content()
        currentHTMLSize = len(html); 

        if(last_HTML_size != 0 and currentHTMLSize == last_HTML_size):
            count_stable_size_iterations += 1
        else:
            count_stable_size_iterations = 0 # reset the counter

        if(count_stable_size_iterations >= min_stabe_size_iterations):
            break
    

        last_HTML_size = currentHTMLSize
        await page.waitFor(check_duration_m_secs)
0

For me the { waitUntil: 'domcontentloaded' } is always my go to. I found that networkidle doesnt work well...

0

waitfornetworkidle() worked for me: https://pptr.dev/api/puppeteer.page.waitfornetworkidle

I don't know why no one has mentioned it yet. If there's an actual reason then, please share.

0

Completely loaded can mean a lot of different things. Wait for all images to be loaded maybe?

await page.waitForFunction(() => ![...document.querySelectorAll('img')].find(i => !i.complete))
pguardiario
  • 53,827
  • 19
  • 119
  • 159