To elaborate on Thomas' excellent answer, I'd like to offer some optimizations and additional considerations.
In the first snippet, there's a risk that the ElementHandle goes stale (the referenced element is removed from the document) between querying and usage. In Playwright, the recommended approach using locators avoids stale handles by re-querying them on every action (Puppeteer has introduced their own locators as of 20.6.0). Following this principle, the trusted event example should be written like:
const selector = "a.buy-now";
const link = await page.$eval(selector, el => el.getAttribute("href"));
await page.click(selector); // re-query the element at each usage
The double query may feel wrong since programmers are accustomed to DRY code, but this will be more reliable in a real-time web scraping context, and generally has negligible performance impact, which is usually measured in the number of page.
calls rather than the number of browser queries, which aren't cross-process network calls.
In the second snippet, text
is never returned from the callback executed in the browser. Furthermore, page.$eval
is shorthand for the common-case scenario where document.querySelector()
is run as the first thing in the evaluate
callback. So, assuming you don't need a trusted event, I'd write this code as:
const link = await page.$eval("a.buy-now", el => {
const {href} = el;
el.click();
return href;
});
As far as when to use each one, it depends on the goals of the script. In a testing context, the first is preferable since it emulates user actions more accurately, at the cost of speed. But if the trusted event version fails or the script runs in a scraping context where there's less motivation to treat the site as a user might, switch to the evaluate
version.
Another consideration is that many scraping scripts start out as vanilla JS code written in the dev tools console. It's generally fine to treat Puppeteer as a wrapper on the browser console and plop your existing browser code into an evaluate
or two alongside some waitForSelector
s as needed. I've seen a good number of Puppeteer scripts in a state of confusion as the author tried to translate their working console code to page.
calls because they thought they had to. In these cases, you may prefer the evaluate
approach, or only partially converting the browser code to Puppeteer code on an as-needed basis for specific trusted events.
Another bonus point for the evaluate
approach is that browser code is mostly synchronous, which is generally easier to write and debug. Puppeteer's async API has a ton of footguns--it's easy to create race conditions or encounter weird errors when you forget to await
something. Being in an evaluate
lets you take advantage of jQuery and other browser tools that make scraping easier, offering features like sizzle selectors that Puppeteer doesn't support.
Speaking more generally, I try to avoid ElementHandles, which are returned by page.$
, page.$$
and page.evaluateHandle
. They're inherently racy, make the code harder to read, and are slower than direct page.$eval
, page.$$eval
and page.evaluate
s. I use them only when trusted events are necessary, or they're unavoidable for some other reason.