The accepted answer shows how to serially visit each page one at a time. However, you may want to visit multiple pages simultaneously when the task is embarrassingly parallel, that is, scraping a particular page isn't dependent on data extracted from other pages.
A tool that can help achieve this is Promise.allSettled
which lets us fire off a bunch of promises at once, determine which were successful and harvest results.
For a basic example, let's say we want to scrape usernames for Stack Overflow users given a series of ids.
Serial code:
const puppeteer = require("puppeteer"); // ^19.6.3
let browser;
(async () => {
browser = await puppeteer.launch();
const [page] = await browser.pages();
const baseURL = "https://stackoverflow.com/users";
const startId = 6243352;
const qty = 5;
const usernames = [];
for (let i = startId; i < startId + qty; i++) {
await page.goto(`${baseURL}/${i}`, {
waitUntil: "domcontentloaded"
});
const sel = ".flex--item.mb12.fs-headline2.lh-xs";
const el = await page.waitForSelector(sel);
usernames.push(await el.evaluate(el => el.textContent.trim()));
}
console.log(usernames);
})()
.catch(err => console.error(err))
.finally(() => browser?.close());
Parallel code:
let browser;
(async () => {
browser = await puppeteer.launch();
const [page] = await browser.pages();
const baseURL = "https://stackoverflow.com/users";
const startId = 6243352;
const qty = 5;
const usernames = (await Promise.allSettled(
[...Array(qty)].map(async (_, i) => {
const page = await browser.newPage();
await page.goto(`${baseURL}/${i + startId}`, {
waitUntil: "domcontentloaded"
});
const sel = ".flex--item.mb12.fs-headline2.lh-xs";
const el = await page.waitForSelector(sel);
const text = await el.evaluate(el => el.textContent.trim());
await page.close();
return text;
})))
.filter(e => e.status === "fulfilled")
.map(e => e.value);
console.log(usernames);
})()
.catch(err => console.error(err))
.finally(() => browser?.close());
Remember that this is a technique, not a silver bullet that guarantees a speed increase on all workloads. It will take some experimentation to find the optimal balance between the cost of creating more pages versus the parallelization of network requests on a given particular task and system.
The example here is contrived since it's not interacting with the page dynamically, so there's not as much room for gain as in a typical Puppeteer use case that involves network requests and blocking waits per page.
Of course, beware of rate limiting and any other restrictions imposed by sites (running the code above may anger Stack Overflow's rate limiter).
For tasks where creating a page
per task is prohibitively expensive or you'd like to set a cap on parallel request dispatches, consider using a task queue or combining serial and parallel code shown above to send requests in chunks. This answer shows a generic pattern for this agnostic of Puppeteer.
These patterns can be extended to handle the case when certain pages depend on data from other pages, forming a dependency graph.
See also Using async/await with a forEach loop which explains why the original attempt in this thread using map
fails to wait for each promise.