Iterating through a list of urls with puppeteer

Question

I'm attempting to iterate through a list of URLs, and instead of puppeteer loading each page, it only loads one. What can I do to make this work?

async function main() {
  const browser = await puppeteer.launch({headless: false});
  const page = await browser.newPage();
  await page.setViewport({width: 1200, height: 720});
  await page.goto('https://s23.a2zinc.net/clients/acmedia/americancoatingsshow2022/Public/Exhibitors.aspx?Index=All#', { waitUntil: 'networkidle0' }); // wait until page load
  const hrefs = await page.$$eval('a', as => as.map(a => a.href));

for (let i = 0; i < hrefs.length; i++) {
    const url = hrefs[i];
    if (url.includes('eBooth.aspx')) {
      console.log(url)
      const page = await browser.newPage()
      await page.goto(`${url}`);
      await page.waitForNavigation({ waitUntil: 'networkidle0' });
    }
}  
main();

ggorlen · Answer 1 · 2022-07-18T21:40:40.747

The main problem is an extra await page.waitForNavigation({ waitUntil: 'networkidle0' }); that will fail to resolve. page.goto already waits for navigation, so you're asking Puppeteer to wait for a navigation that will never happen.

Only use page.waitForNavigation if you're doing something to trigger a navigation, not as part of a typical page.goto call. Remove this line and your code should work (more or less) as expected.

Furthermore, you're opening a whole new page (browser tab) per link. That's 360 tabs by my count, liable to run most computers out of memory. Better to navigate a single page repeatedly or close pages after you're finished doing whatever you plan to do on these pages. If that's too slow, try running chunks in parallel or using a task queue.

Also, the links are available in the static HTML, so you might not need Puppeteer here, again, depending on what you're planning on doing on each page. If you can get all of the data from each page statically, you could have a massive speedup, completing 360 scrapes with fetch/cheerio in a fraction of the time it'd take Puppeteer.

If you do stick with Puppeteer to bypass detection or deal with JS/interactivity, consider using domcontentloaded rather than networkidle0, which is usually unnecessarily strict and slow. The blog post linked explains the difference between the various loading conditions. See also my answer in the canonical thread Puppeteer wait until page is completely loaded for a deeper dive into page loading in Puppeteer.

a[href] is a more precise selector than a, because it's possible that some a anchors have no href and should be discarded to avoid undefineds popping up.

Here's how I'd write this (with the aforementioned caveat that Puppeteer might not be needed at all):

const puppeteer = require("puppeteer"); // ^14.3.0

let browser;
(async () => {
  browser = await puppeteer.launch({headless: false});
  const [page] = await browser.pages();
  await page.setViewport({width: 1200, height: 720});
  const url = "https://s23.a2zinc.net/clients/acmedia/americancoatingsshow2022/Public/Exhibitors.aspx?Index=All#";
  await page.goto(url, {waitUntil: "domcontentloaded"});
  const hrefs = await page.$$eval("a[href]", els =>
    els.map(a => a.href).filter(e => e.includes("eBooth.aspx"))
  );
  console.log(hrefs.length); // => 360

  for (const url of hrefs) {
    await page.goto(url);
    // page is loaded; do your thing on this page
  }
})()
  .catch(err => console.error(err))
  .finally(() => browser?.close())
;

Iterating through a list of urls with puppeteer

1 Answers1

Linked