I have following scenario:
- My scrapes are behind a login, so there is one login page that I always need to hit first
- then I have a list of 30 urls that can be scraped asynchronously for all I care
- then at the very end, when all those 30 urls have been scraped I need to hit one last separate url to put the results of the 30 URL scrape into a firebase db and to do some other mutations (like geo lookups for addresses etc)
Currently I have all 30 urls in a request queue (through the Apify web-interface) and I'm trying to see when they are all finished.
But obviously they all run async so that data is never reliable
const queue = await Apify.openRequestQueue();
let pendingRequestCount = await queue.getInfo();
The reason why I need that last URL to be separate are two-fold:
- Most obvious reason being that I need to be sure I have the results of all 30 scrapes before I send everything to DB
- neither of the 30 URL's allow me to do Ajax / Fetch calls, which I need for sending to Firebase and do the geo lookups of addresses
Edit: Tried this based on answer from @Lukáš Křivka. handledRequestCount in the while loop reaches a max of 2, never 4 ... and Puppeteer just ends normally. I've put the "return" inside the while loop because otherwise requests never finish (of course).
In my current test setup I have 4 urls to be scraped (in the Start URLS input fields of Puppeteer Scraper (on Apify.com) and this code :
let title = "";
const queue = await Apify.openRequestQueue();
let {handledRequestCount} = await queue.getInfo();
while (handledRequestCount < 4){
await new Promise((resolve) => setTimeout(resolve, 2000)) // wait for 2 secs
handledRequestCount = await queue.getInfo().then((info) => info.handledRequestCount);
console.log(`Curently handled here: ${handledRequestCount} --- waiting`) // this goes max to '2'
title = await page.evaluate(()=>{ return $('h1').text()});
return {title};
}
log.info("Here I want to add another URL to the queue where I can do ajax stuff to save results from above runs to firebase db");
title = await page.evaluate(()=>{ return $('h1').text()});
return {title};