40

I'm interested in the differences of these two blocks of code.

const $anchor = await page.$('a.buy-now');
const link = await $anchor.getProperty('href');
await $anchor.click();
await page.evaluate(() => {
    const $anchor = document.querySelector('a.buy-now');
    const text = $anchor.href;
    $anchor.click();
});

I've generally found raw DOM elements in page.evaluate() easier to work and the ElementHandles returned by the $ methods an abstraction to far.

However I felt perhaps that the async Puppeteer methods might be more performant or improve reliability? I couldn't find any guidance on this in the docs and would be interested in learning more about the pro's/con's about each approach and the motivation behind adding methods like page.$$().

Thomas Dondorf
  • 23,416
  • 6
  • 84
  • 105
lpoulter
  • 503
  • 1
  • 4
  • 5

2 Answers2

79

The main difference between those lines of code is the interaction between the Node.js and the browser environment.

The first code snippet will do the following:

  • Run document.querySelector in the browser and return the element handle (to the Node.js environment)
  • Run getProperty on the handle and return the result (to the Node.js environment)
  • Click an element inside the browser

The second code snippet simply does this:

  • Run the given function in the browser context (and return results to the Node.js environment)

Performance

Regarding the performance of these statements, one has to remember that puppeteer communicates via WebSockets with the browser. Therefore the second statement will run faster as there is just one command send to the browser (in contrast to three).

This might make a big difference if the browser you are connecting to is running on a different machine (connected to using puppeteer.connect). It will likely only result in a few milliseconds difference if the script and the browser are located on the same machine. In the latter case it might therefore not make a big difference.

Advantage of using element handles

Using element handles has some advantages. First, functions like elementHandle.click will behave more "human-like" in contrast to using document.querySelector('...').click(). puppeteer will for example move the mouse to the location and click in the center of the element instead of just executing the click function.

When to use what

In general, I recommend to use page.evaluate whenever possible as this API is also a lot easier to debug. When an error happens, you can simply reproduce the error by opening the DevTools in your Chrome browser and rerunning the same lines in your browser. If you are mixing a lot of page.$ statements together it might be much harder to understand what the problem is and whether it happened inside the Node.js or the browser runtime.

Use the element handles if you need the element for longer (because you maybe have make some complex calculations or wait for an external event before you can extract information from them).

Thomas Dondorf
  • 23,416
  • 6
  • 84
  • 105
  • 20
    Wow, wow! Excellent response, my dear. Why the hell are these things not discussed in the official documentation? Otherwise, send me the extract and I'll shut my fingers from now on. Thanks for such clarification. – 1antares1 Jun 27 '19 at 15:24
  • Is it possible to somehow highlight the html-element in chrome, when I set a breakpoint in vscode and I only have the element-handle or do I have to use evaluate? – Ini Jul 10 '19 at 11:57
0

To elaborate on Thomas' excellent answer, I'd like to offer some optimizations and additional considerations.

In the first snippet, there's a risk that the ElementHandle goes stale (the referenced element is removed from the document) between querying and usage. In Playwright, the recommended approach using locators avoids stale handles by re-querying them on every action (Puppeteer has introduced their own locators as of 20.6.0). Following this principle, the trusted event example should be written like:

const selector = "a.buy-now";
const link = await page.$eval(selector, el => el.getAttribute("href"));
await page.click(selector); // re-query the element at each usage

The double query may feel wrong since programmers are accustomed to DRY code, but this will be more reliable in a real-time web scraping context, and generally has negligible performance impact, which is usually measured in the number of page. calls rather than the number of browser queries, which aren't cross-process network calls.

In the second snippet, text is never returned from the callback executed in the browser. Furthermore, page.$eval is shorthand for the common-case scenario where document.querySelector() is run as the first thing in the evaluate callback. So, assuming you don't need a trusted event, I'd write this code as:

const link = await page.$eval("a.buy-now", el => {
  const {href} = el;
  el.click();
  return href;
});

As far as when to use each one, it depends on the goals of the script. In a testing context, the first is preferable since it emulates user actions more accurately, at the cost of speed. But if the trusted event version fails or the script runs in a scraping context where there's less motivation to treat the site as a user might, switch to the evaluate version.

Another consideration is that many scraping scripts start out as vanilla JS code written in the dev tools console. It's generally fine to treat Puppeteer as a wrapper on the browser console and plop your existing browser code into an evaluate or two alongside some waitForSelectors as needed. I've seen a good number of Puppeteer scripts in a state of confusion as the author tried to translate their working console code to page. calls because they thought they had to. In these cases, you may prefer the evaluate approach, or only partially converting the browser code to Puppeteer code on an as-needed basis for specific trusted events.

Another bonus point for the evaluate approach is that browser code is mostly synchronous, which is generally easier to write and debug. Puppeteer's async API has a ton of footguns--it's easy to create race conditions or encounter weird errors when you forget to await something. Being in an evaluate lets you take advantage of jQuery and other browser tools that make scraping easier, offering features like sizzle selectors that Puppeteer doesn't support.

Speaking more generally, I try to avoid ElementHandles, which are returned by page.$, page.$$ and page.evaluateHandle. They're inherently racy, make the code harder to read, and are slower than direct page.$eval, page.$$eval and page.evaluates. I use them only when trusted events are necessary, or they're unavoidable for some other reason.

ggorlen
  • 44,755
  • 7
  • 76
  • 106