I'm trying to scrape ~1500 URLs in a Node process(es) using the puppeteer-cluster package: https://github.com/thomasdondorf/puppeteer-cluster
I'm doing this on a DigitalOcean droplet with 2GB of RAM, but I'm getting repeated "Navigation failed because browser has disconnected" errors no matter how I tweak the settings. Tempted to remove the package and try to roll it myself to see if I can get better results. Is there something I'm missing?
Here's the gist of the code I'm working with:
try {
cluster = await Cluster.launch({
concurrency: Cluster.CONCURRENCY_BROWSER,
puppeteerOptions: {
args: ["--disable-setuid-sandbox", "--no-sandbox"]
},
maxConcurrency: 4
});
} catch (e) {
console.log(e.message);
}
cluster.on("taskerror", (err, data) => {
console.log(`Error while crawling: ${err.message}, with this data:`);
console.log(data);
});
await cluster.task(async ({ page, data }) => {
// Do stuff...
const {sitemapObject} = data;
await page.goto(url);
let result = await page.evaluate(() => {
// I do some DOM manipulation stuff here.
let stuff = {};
return stuff;
}
});
for (let sitemapObject of sitemapObjects) {
await cluster.queue({
sitemapObject
});
}
try {
await cluster.idle();
await cluster.close();
} catch (error) {
console.log(error.message);
}