0

System: Windows 8.1 64bit with binary from the main page, version 2.0

I have a .txt file with 1 URL per line, I read every line and open the page, searching for a specific url.match (changed domain for privacy reasons in the code) - if found,print the found JSON, abort request, unload page. My .txt file contains 12500 links, for testing purpose I split it into the first 10/100/500 urls.

Problem 1: If I try 10 urls, it prints 9 and uses 40-50% cpu afterwards

Problem 2: If I try 100 urls, it prints 98, uses 40-50% cpu afterwards for whatever reasons, then it crashes after 2-3 minutes.

Problem 3: Same goes for 98 links (it prints 96, uses 40-50% cpu, then crashes too) and for 500 links

TXT-files: https://www.dropbox.com/s/eeiy12ku5k15226/sitemaps.7z?dl=1

Crash dumps for 98, 100 and 500 links: https://www.dropbox.com/s/ilvbg8lv1bizjti/Crash%20dumps.7z?dl=1

console.log('Hello, world!');
var fs = require('fs');
var stream = fs.open('100sitemap.txt', 'r');
var line = stream.readLine();
var webPage = require('webpage');
var i = 1;

while(!stream.atEnd() || line != "") {
     //console.log(line);
    var page = webPage.create();
    page.settings.loadImages = false;
    page.open(line, function() {});
    //console.log("opened " + line);
    page.onResourceRequested = function(requestData, request) {
        //console.log("BEFORE: " +requestData.url);
        var match = requestData.url.match(/example.com\/ac/g)
        //console.log("Match: " + match);
        //console.log("Line: " + line);
        //console.log("Match: " + match);
        if (match != null) {
            var targetString = decodeURI(JSON.stringify(requestData.url));
            var klammerauf = targetString.indexOf("{");
            var jsonobjekt = targetString.substr(klammerauf,   (targetString.indexOf("}") - klammerauf) + 1);
            targetJSON = (decodeURIComponent(jsonobjekt));
            console.log(i);
            i++;
            console.log(targetJSON);
            console.log("");
            request.abort();
            page.close();
        }
    };
    var line = stream.readLine();
}

//console.log("File closed");
//stream.close();
Artjom B.
  • 61,146
  • 24
  • 125
  • 222
Vega
  • 2,661
  • 5
  • 24
  • 49

1 Answers1

0

Concurrent Requests

You really shouldn't be loading pages in a loop, because a loop is a synchronous construct whereas page.open() is asynchronous. Doing so, you will experience the problem that memory consumption sky-rockets, because all URLs are opening at the same time. This will be a problem with 20 or more URLs in the list.

Function-level scope

The other problem is that JavaScript has function level scope. That means that even when you define the page variable inside of the while block it is available globally. Since it is defined globally, you get a problem with the asynchronous nature of PhantomJS. The page inside of the page.onResourceRequested function definition is very likely not the same page that was used to open a URL which triggered the callback. See more on that here. A common solution would to use an IIFE to bind the page variable to only one iteration, but you need to rethink your whole approach.

Memory-leak

You also have a memory-leak, because when the URL in the page.onResourceRequested event doesn't match, you're not aborting the request and not cleaning the page instance up. You probably want to do that for all URLs and not just the ones that match your specific regex.

Easy fix

A fast solution would be to define a function that does one iteration and call the next iteration when the current one finished. You can also re-use one page instance for all requests.

var page = webPage.create();

function runOnce(){
    if (stream.atEnd()) {
        phantom.exit();
        return;
    }
    var url = stream.readLine();
    if (url === "") {
        phantom.exit();
        return;
    }

    page.open(url, function() {});

    page.onResourceRequested = function(requestData, request) {
        /**...**/

        request.abort();

        runOnce();
    };
}

runOnce();
Community
  • 1
  • 1
Artjom B.
  • 61,146
  • 24
  • 125
  • 222
  • "A fast solution would be to define a function that does one iteration and call the next iteration when the current one finished" - that is sadly not possible, it needs to be asynchronous because of the high number of URLs I need to check. – Vega Sep 09 '15 at 11:30
  • It is asynchronous. If you want to make it faster, then you would need to implement a pool of pages. Simply starting a tab for every URL you've got is not feasible. – Artjom B. Sep 09 '15 at 11:48
  • How could I implement such page pool? – Vega Sep 09 '15 at 12:29