0

I'm trying to scrape ~1500 URLs in a Node process(es) using the puppeteer-cluster package: https://github.com/thomasdondorf/puppeteer-cluster

I'm doing this on a DigitalOcean droplet with 2GB of RAM, but I'm getting repeated "Navigation failed because browser has disconnected" errors no matter how I tweak the settings. Tempted to remove the package and try to roll it myself to see if I can get better results. Is there something I'm missing?

Here's the gist of the code I'm working with:


  try {
    cluster = await Cluster.launch({
      concurrency: Cluster.CONCURRENCY_BROWSER,
      puppeteerOptions: {
        args: ["--disable-setuid-sandbox", "--no-sandbox"]
      },
      maxConcurrency: 4
    });
  } catch (e) {
    console.log(e.message);
  }

  cluster.on("taskerror", (err, data) => {
    console.log(`Error while crawling: ${err.message}, with this data:`);
    console.log(data);
  });

  await cluster.task(async ({ page, data }) => {
      // Do stuff...
    const {sitemapObject} = data;

    await page.goto(url);

    let result = await page.evaluate(() => {
        // I do some DOM manipulation stuff here. 
        let stuff = {};
        return stuff;
    }
  });

  for (let sitemapObject of sitemapObjects) {
    await cluster.queue({
      sitemapObject
    });
  }

  try {
    await cluster.idle();
    await cluster.close();
  } catch (error) {
    console.log(error.message);
  }
Alex MacArthur
  • 2,220
  • 1
  • 18
  • 22
  • 1
    Author here, is the variable `url` actually defined? Are you sure you are not missing `await`s? – Thomas Dondorf Aug 01 '19 at 07:17
  • 1
    Seems to be a resource problem ([issue](https://github.com/thomasdondorf/puppeteer-cluster/issues/179) in the repo). See [this answer](https://stackoverflow.com/a/57295869/5627599) for more information. – Thomas Dondorf Aug 01 '19 at 07:29

0 Answers0