2

I've written a scraper that iterates through every page on a website and extracts the information. There are a lot of pages; if this program was functioning non-stop it would take about a week to finish. However, every two or three hours it just hangs when it tries to extract the info from the page, and it never continues. This is frustrating because I keep having to restart the script. Here is the skeleton of it, run using NodeJS:

index = 0;
finalIndex = 50000;

function scrape(){
    if(index < finalIndex){
        //hit the website using nightmare, navigate to page, extract info, store as JSON
        console.log("finished scraping page number: ", index);
        index++;
        scrape();
    }
}

scrape();

I'd like to have a function, in this file or another, that runs the scrape function, and then every 2 hours kills the function and restarts it from the last index that it tried to scrape from. I've tried thinking of formulations using setTimeout, but I'm not sure how to kill a function stack half-way through. I also don't want the restarting function to fail if the scrape function has already started hanging.

What's the best way for me to do this? Other solutions to this problem are welcome, but even from a JavaScript knowledge standpoint I'd like to know how to do this for the future.

Here is my function in a bit more detail:

function scrape() {
console.log("initializing scrape from index: " + index);
var nightmare = Nightmare();
if (index < indexEnd) {

    nightmare
    .goto(hidTestURL) //connect to the main site
    .wait('input[name="propertySearchOptions:advanced"]')
    .wait(4000)
    .goto(pageURL) //navigate to the specific entry's info page
    .wait('a[id="propertyHeading_searchResults"]')
    .wait(2500)
    .evaluate(function(){
        return document.querySelector('body').innerHTML;
    })
    .then(function(html){
      return xP([html, {data: css.data}])() //scrape the data from the page
    })
    .then(cleanDetails)
    .then(writeResult)
    .then(_ => {
                nightmare.end();
                nightmare.proc.disconnect();
                nightmare.proc.kill();
                nightmare.ended = true;
                nightmare = null;
         })
    .then(function(){
          console.log("successful scrape for ", ids[index]);
          ++index;
          setTimeout(scrape(), interval); //start scraping the next entry after a specified delay (default 4 seconds)
        })
    .catch(function(e){
      if (e.message === 'EmptyProperty'){
        console.log('EmptyProperty');
          ++index;
          setTimeout (scrape, interval / 2);
      }
      else {
            return appendFileP(logFile, new Date().toString() + " unhandled error at " + street + index + ' ' + e + '\r\n', 'utf8')
                .then(function(){
                    if (numOfTries < 2){
                        console.log("Looks like some other error, I'll retry: %j", e.message);
                        ++numOfTries;                      
                        setTimeout (scrape, interval * 5);
                        return nightmare.end();
                    }
                    else {
                        console.log("Tried 3 times, moving on");
                        ++index;
                        numOfTries = 0;
                        setTimeout (scrape, interval * 5);
                        return nightmare.end();
                    }
                });
        }
    })

}

There are helper functions whose code I haven't included, but their names should be obvious, and I don't think their function is an important part of the problem. I also want to make it clear that I'm running this using Node, it never runs in a browser.

Phylth
  • 368
  • 2
  • 14
  • I don't think this is possible to be done in Javascript. (Someone teach me this one if I'm wrong :D) A browser might ask to kill the process if it takes too much time to execute or load. EDITED But Nodejs is NOT Javascript. you can kill processes in Node.js – Canilho Oct 12 '16 at 19:51
  • trying not to stray too far from your method, maybe u can try something like the one described here: http://stackoverflow.com/questions/672732/prevent-long-running-javascript-from-locking-up-browser – Sajjan Sarkar Oct 12 '16 at 20:00
  • Find out what causes your code to hang and fix that. Until then, you might use the task scheduler that comes with your OS to kill and restart the `node` process every two hours. – Bergi Oct 12 '16 at 21:04
  • Is `scrape` asynchronous or are you really starting 50000 nightmare visits concurrently (and recursing to a stack depth of 50000)?! Please show us your actual code. – Bergi Oct 12 '16 at 21:05
  • I use promises to make sure only one scrape is being performed at once, otherwise I get blocked. – Phylth Oct 13 '16 at 18:03
  • Another thing to keep in mind is that when I restart a process, it needs to use the current index value. I'm not sure how to do that if I use a task scheduler. – Phylth Oct 13 '16 at 18:03
  • Consider using web workers. –  Oct 13 '16 at 18:21
  • @torazaburo Not in nodejs? – Bergi Oct 13 '16 at 18:22
  • 1
    You surely want `setTimeout(scrape, interval);` not `setTimeout(scrape(), interval);` – Bergi Oct 13 '16 at 18:22
  • @Phylth If you write a long-running process that might get killed and restarted in the middle you need to save your progress somewhere (e.g. to the file system) so that you can continue where you left off – Bergi Oct 13 '16 at 18:23
  • I'm wondering why you're trying to hack around this instead of figuring out why it hangs. Have you monitored things like memory usage to see if it's growing steadily? – DavidS Oct 13 '16 at 18:25
  • So where exactly is that "*hangs when it tries to extract the info from the page*" in your code, what logs do you get until it breaks? Are you saing that one of the promises simply never resolves (because waits indefinitely for the info to appear)? Then give it a timeout. – Bergi Oct 13 '16 at 18:25
  • @Bergi Putting timeouts inside each promise? That sounds like a viable solution, I'll definitely give it a try. I don't think `scrape()` vs `scrape` has been making any difference inside the timeout function, should it be? – Phylth Oct 13 '16 at 20:27
  • I believe the hanging happens when nightmare tries to access the main website URL, or perhaps when it clicks on a button to navigate somewhere else on the website. It's been hard to pinpoint, but it's at one of those two points, which is why I have double wait functions going on. There were problems near the beginning with trying to continue before nightmare had fully accessed the page. – Phylth Oct 13 '16 at 20:29
  • @DavidS No I have not monitored the memory usage, I don't know how to do this with NodeJS, do you know a good article to read on this? I tried putting in nightmare.end() statements everywhere, as i thought it was never closing and therefore sucking up memory, but it didn't make a difference. – Phylth Oct 13 '16 at 20:33
  • 1
    I have a PR for a `halt` api to help accomplish this. You can try to get it merged, fork it, or use it directly. https://github.com/segmentio/nightmare/pull/788 – Nick Oct 13 '16 at 21:02
  • 1
    You can use Promise.race to make sure it never takes longer than a certain amount of time too (with `halt` to kill the instance) https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Promise/race – Nick Oct 13 '16 at 21:16
  • @Nick Can you make that an answer? – Bergi Oct 13 '16 at 21:18
  • @Bergi Yupp! just did – Nick Oct 13 '16 at 21:27

3 Answers3

3

I had to tackle a similar problem with this before, and the solution I chose was to ensure that each page finishes within a certain amount of time otherwise continue to the next page. You can wrap the nightmare code in a promise and use Promise.raceto ensure it finishes within a set amount of time. Then, if it times out, use the .halt api that was introduced in v2.8.0 to prevent memory leaks and abandoned processes.

It would look something like this:

Promise.race([
  doNightmareCodeAndReturnPromise(nightmareInstance),
  new Promise((resolve, reject) => setTimeout(() => reject('timed out'), 5000))
])
.then(result => /* save result */)
.catch(error => {
  if (error === 'timed out') nightmareInstance.halt()
})
Nick
  • 930
  • 1
  • 11
  • 17
1

JavaScript is single threaded so you cannot "kill" running function from "outside" as there simply is nothing "outside" (like another thread).

The only multi-tasking option you have with JS is cooperative multitasking - when you design your function to do small chunk of a job each time it gets invoked.

Here is an example of such chunked function:

var index = 0;
var finalIndex = 50000;

var working = true; // if working == false then stop running.

function scrape(){

    if( !working )
      return;   

    if(index < finalIndex){
        // scrap code is here ... 
        console.log("finished scraping page number: ", index);
        index++;
        setTimeout(scrape); // schedule scrape for the next chunk (iteration)
                            // and return immediately
    }
}

// reset working variable in 60 seconds  
setTimeout( function() { working = false; }, 60000 );

scrape(); // start iterations

This scrape function above does single scrap action and at the end schedules itself for the next iteration.

Another timer is used to set working variable to false. This will signal scrape to break the "loop" and stop.

c-smile
  • 26,734
  • 7
  • 59
  • 86
  • This is an clever workaround, but if the code started hanging before the timeout finished, then it wouldn't restart properly. Also, I'd need to add a clause that does `working = true` somewhere inside scrape, otherwise it wouldn't restart, it would just stop. – Phylth Oct 13 '16 at 18:06
0

I think that you can not easily kill your function, but you can change the structure of your code a little bit. Maybe your code reaches the limit of the call stack of Node and stops because of that.

Try transforming your code into a for loop like this:

finalIndex = 50000;
for (var index = 0; index < finalIndex; index++) {
  console.log("finished scraping page number: ", index);
  scrape();
}
Hugo David Farji
  • 174
  • 2
  • 12
  • Would this involve multiple scrape calls working at the same time? I used a promise chain with recursion, and one of the reasons is because the website can only be hit once at a time, otherwise the system admin automatically blocks my IP. (which is a big problem, and something I don't want to risk) – Phylth Oct 13 '16 at 17:57
  • How can i check the `call stack` usage of Node? – Phylth Oct 13 '16 at 18:05
  • Oh, i didn't that that would be an issue, because your code already does that. Really I don't know how to check it with precision, but the problem here is that you use a recursive function, doesn't matter how much you increase the call stack, your function will probably exceed it. Will be better if you try with the other solutions (Some with promises probably) – Hugo David Farji Oct 14 '16 at 12:25