2

I have following scenario:

  • My scrapes are behind a login, so there is one login page that I always need to hit first
  • then I have a list of 30 urls that can be scraped asynchronously for all I care
  • then at the very end, when all those 30 urls have been scraped I need to hit one last separate url to put the results of the 30 URL scrape into a firebase db and to do some other mutations (like geo lookups for addresses etc)

Currently I have all 30 urls in a request queue (through the Apify web-interface) and I'm trying to see when they are all finished.

But obviously they all run async so that data is never reliable

 const queue = await Apify.openRequestQueue();
 let  pendingRequestCount  = await queue.getInfo();

The reason why I need that last URL to be separate are two-fold:

  1. Most obvious reason being that I need to be sure I have the results of all 30 scrapes before I send everything to DB
  2. neither of the 30 URL's allow me to do Ajax / Fetch calls, which I need for sending to Firebase and do the geo lookups of addresses

Edit: Tried this based on answer from @Lukáš Křivka. handledRequestCount in the while loop reaches a max of 2, never 4 ... and Puppeteer just ends normally. I've put the "return" inside the while loop because otherwise requests never finish (of course).

In my current test setup I have 4 urls to be scraped (in the Start URLS input fields of Puppeteer Scraper (on Apify.com) and this code :

let title = "";
const queue = await Apify.openRequestQueue();
let {handledRequestCount} = await queue.getInfo();
while (handledRequestCount < 4){
    await new Promise((resolve) => setTimeout(resolve, 2000)) // wait for 2 secs
    handledRequestCount = await queue.getInfo().then((info) => info.handledRequestCount);
    console.log(`Curently handled here: ${handledRequestCount} --- waiting`) // this goes max to '2'
    title = await page.evaluate(()=>{ return $('h1').text()});
    return {title};
}
log.info("Here I want to add another URL to the queue where I can do ajax stuff to save results from above runs to firebase db");
title = await page.evaluate(()=>{ return $('h1').text()});
return {title};

3 Answers3

0

I would need to see your code to answer completely correctly but this has solutions.

Simply use Apify.PuppeteerCrawler for the 30 URLs. Then you run the crawler with await crawler.run().

After that, you can simply load the data from the default dataset via

const dataset = await Apify.openDataset();
const data = await dataset.getdata().then((response) => response.items);

And do whatever with the data, you can even create new Apify.PuppeteerCrawler to crawl the last URL and use the data.

If you are using Web Scraper though, it is a bit more complicated. You can either:

1) Create a separate actor for the Firebase upload and pass it a webhook from your Web Scraper to load the data from it. If you look at the Apify store, we already have a Firestore uploader.

2) Add a logic that will poll the requestQueue like you did and only when all the requests are handled, you proceed. You can create some kind of loop that will wait. e.g.

const queue = await Apify.openRequestQueue();
let { handledRequestCount } = await queue.getInfo();
while (handledRequestCount < 30) {
    console.log(`Curently handled: ${handledRequestCount } --- waiting`)
    await new Promise((resolve) => setTimeout(resolve, 2000)) // wait for 2 secs
    handledRequestCount = await queue.getInfo().then((info) => info.handledRequestCount);
}
// Do your Firebase stuff
Lukáš Křivka
  • 953
  • 6
  • 9
  • thanks for chiming in. I edited my question , trying out your code (probably in the wrong way) – Jason Cooper Aug 13 '19 at 10:38
  • Maybe there is something I overlooked. It would be great to see a context where you run this code. Also each run has attached a request queue tab and you can see there how many requests you are pending/handled. Perhaps you have less or more than 4. Of course, if you add the `return` into the while loop, it will break the solution. – Lukáš Křivka Aug 14 '19 at 11:12
0

In the scenario where you have one async function that's called for all 30 URLs you scrape, first make sure the function returns its result after all necessary awaits, you could await for Promise.all(arrayOfAll30Promises) then run your last piece of code

Mouradif
  • 2,666
  • 1
  • 20
  • 37
  • thanks for contributing. I wonder if it would be too much to ask for you to help me figure out where this code needs to go (and what it would look like) in case I'm using the web-interface version of Apify puppeteer scraper. Everything there is wrapped in one async function that gets repeated for all crawled pages: async function pageFunction(context) { – Jason Cooper Aug 13 '19 at 15:26
0

Because I was not able to get consistent results with the {handledRequestCount} from getInfo() (see my edit in my original question), I went another route.

I'm basically keeping a record of which URL's have already been scraped via the key/value store.

 urls = [
   {done:false, label:"vietnam", url:"https://en.wikipedia.org/wiki/Vietnam"},
   {done:false , label:"cambodia", url:"https://en.wikipedia.org/wiki/Cambodia"}
 ]

 // Loop over the array and add them to the Queue
 for (let i=0; i<urls.length; i++) {
   await queue.addRequest(new Apify.Request({ url: urls[i].url }));
 }

 // Push the array to the key/value store with key 'URLS'
 await Apify.setValue('URLS', urls);

Now every time I've processed an url I set its "done" value to true. When they are all true I'm pushing another (final) url into the queue:

 await queue.addRequest(new Apify.Request({ url: "http://www.placekitten.com" }));