How to manage a 'pool' of PhantomJS instances

Question

I'm planning a webservice for my own use internally that takes one argument, a URL, and returns html representing the resolved DOM from that URL. By resolved I mean that the webservice will firstly get the page at that URL, then use PhantomJS to 'render' the page, and then return the resulting source after all DHTML, AJAX calls etc are executed. However launching phantom on a per-request basis (which I'm doing now) is way too sluggish. I would rather have a pool of PhantomJS instances with one always available to serve the latest call to my webservice.

Has any work been done on this kind of thing before? I'd rather base this webservice on the work of others than write a pool manager / http proxy server for myself from scratch.

More Context: I've listed the 2 similar projects that I've seen so far below and why I've avoided each one, resulting in this question about managing a pool of PhantomJS instances instead.

jsdom - from what I've seen it has great functionality for executing scripts on a page, but it doesn't attempt to replicate browser behaviour, so if I were use it as a general purpose "DOM resolver" there'd end up being a lot of extra coding to handle all kinds of edges cases, event calling, etc. The first example I saw was having to manually call the onload() function of the body tag for a test app I set up using node. It seemed like the beginning of a deep rabbit hole.

Selenium - It just has soo many more moving parts, so setting up a pool to manage long lived browser instances will just be more complicated than using PhantomJS. I don't need any of it's macro recording / scripting benefits. I just want a webservice that is as performant at getting a webpage and resolving it's DOM as if I were browsing to that URL with a browser (or even faster if I can make it ignore images etc.)

JasonS · Answer 1 · 2015-11-10T22:18:30.403

I setup a PhantomJs Cloud Service, and it pretty much does what you are asking. It took me about 5 weeks of work implement.

The biggest problem you'll run into is the known-issue of memory leaks in PhantomJs. The way I worked around this is to cycle my instances every 50 calls.

The second biggest problem you'll run into is per-page processing is very cpu and memory intensive, so you'll only be able to run 4 or so instances per CPU.

The third biggest problem you'll run into is that PhantomJs is pretty wacky with page-finish events and redirects. You'll be informed that your page is finished rendering before it actually is. There are a number of ways to deal with this, but nothing 'standard' unfortunately.

The fourth biggest problem you'll have to deal with is interop between nodejs and phantomjs thankfully there are a lot of npm packages that deal with this issue to choose from.

So I know I'm biased (as I wrote the solution I'm going to suggest) but I suggest you check out PhantomJsCloud.com which is free for light usage.

Jan 2015 update: Another (5th?) big problem I ran into is how to send the request/response from the manager/load-balancer. Originally I was using PhantomJS's built-in HTTP server, but kept running into it's limitations, especially regarding maximum response-size. I ended up writing the request/response to the local file-system as the lines of communication. * Total time spent on implementation of the service represents perhaps 20 man-weeks issues is perhaps 1000 hours of work. * and FYI I am doing a complete rewrite for the next version.... (in-progress)

Great answer Jason. It would be really nice if you could go ahead an tell us more about the implementation details. How do you manage all the instances for example? Also, how do you launch de Phantom instances from Node itself? Any module recommendation to do so? Or you spawn the processes? — Nobita, Jun 15 '14 at 18:52
I do all the management from a nodejs 'router' app on the server. it launches multiple phantomjs.exe instances via the normal nodejs spawn process commands. nothing special in that regard actually. I tried all the various phantomjs wrappers found on NPM, but frankly they mostly suck. Ended up just using phantomjs's built-in http server to communicate to/from the nodejs router app. — JasonS, Jun 19 '14 at 14:41
what about creating several webpage object within one phantomJS instance ? is there anything wrong with that ? — Xsmael, Jun 26 '16 at 21:12

score 17 · Answer 2 · edited Jun 20 '20 at 09:12

17

The async JavaScript library works in Node and has a queue function that is quite handy for this kind of thing:

queue(worker, concurrency)

Creates a queue object with the specified concurrency. Tasks added to the queue will be processed in parallel (up to the concurrency limit). If all workers are in progress, the task is queued until one is available. Once a worker has completed a task, the task's callback is called.

Some pseudocode:

function getSourceViaPhantomJs(url, callback) {
  var resultingHtml = someMagicPhantomJsStuff(url);
  callback(null, resultingHtml);
}

var q = async.queue(function (task, callback) {
  // delegate to a function that should call callback when it's done
  // with (err, resultingHtml) as parameters
  getSourceViaPhantomJs(task.url, callback);
}, 5); // up to 5 PhantomJS calls at a time

app.get('/some/url', function(req, res) {
  q.push({url: params['url_to_scrape']}, function (err, results) {
    res.end(results);
  });
});

Check out the entire documentation for queue at the project's readme.

edited Jun 20 '20 at 09:12

Community

1
1

answered Apr 18 '12 at 05:21

Michelle Tilley

157,729
40
374
311

Do you know how the queuing works in detail? I'm thinking it's calling multiple XHR requests in queue right? I'm looking for a solution which actually keeps the phantomjs processes running as a daemon, rather than spinning one up each time a task comes in. – CMCDragonkai Oct 01 '13 at 03:37
@CMCDragonkai The question mentions that "a pool of PhantomJS instances with one always available to serve the latest call to my webservice," which implies constantly running PhantomJS daemons, but this answer would work with either case. All the `async.queue` function does is make sure no more than a certain number of calls to the function are outstanding at any given time; what you do inside that function is up to you. – Michelle Tilley Oct 01 '13 at 03:41
2

You my friend, almost 4 years later, have saved me quite the headache. – michaelgmcd Feb 19 '16 at 22:43

score 15 · Answer 3 · answered Dec 02 '15 at 19:07

For my master thesis, I developed the library phantomjs-pool which does exactly this. It allows to provide jobs which are then mapped to PhantomJS workers. The library handles the job distribution, communication, error handling, logging, restarting and some more stuff. The library was successfully used to crawl more than one million pages.

Example:

The following code executes a Google search for the numbers 0 to 9 and saves a screenshot of the page as googleX.png. Four websites are crawled in parallel (due to the creation of four workers). The script is started via node master.js.

master.js (runs in the Node.js environment)

var Pool = require('phantomjs-pool').Pool;

var pool = new Pool({ // create a pool
    numWorkers : 4,   // with 4 workers
    jobCallback : jobCallback,
    workerFile : __dirname + '/worker.js', // location of the worker file
    phantomjsBinary : __dirname + '/path/to/phantomjs_binary' // either provide the location of the binary or install phantomjs or phantomjs2 (via npm)
});
pool.start();

function jobCallback(job, worker, index) { // called to create a single job
    if (index < 10) { // index is count up for each job automatically
        job(index, function(err) { // create the job with index as data
            console.log('DONE: ' + index); // log that the job was done
        });
    } else {
        job(null); // no more jobs
    }
}

worker.js (runs in the PhantomJS environment)

var webpage = require('webpage');

module.exports = function(data, done, worker) { // data provided by the master
    var page = webpage.create();

    // search for the given data (which contains the index number) and save a screenshot
    page.open('https://www.google.com/search?q=' + data, function() {
        page.render('google' + data + '.png');
        done(); // signal that the job was executed
    });

};

This is a great library. I'm wondering, is there a way to detect when there are no more processes to be spawned? As in, waiting, via async or a promise, after `pool.start()` to do something once a series of processes has completed? — afithings, Sep 07 '16 at 15:31
Thank you. Currently there is no way to do this as simple as with async. However, you can use the callback for each individual job (which fires when one job is done) and increase a counter that way. So you are still able to detect when all jobs are finished. — Thomas Dondorf, Sep 15 '16 at 09:19

TTT · Answer 4 · 2016-07-25T10:12:13.503

5

As an alternative to @JasonS great answer you can try PhearJS, which I built. PhearJS is a supervisor written in NodeJS for PhantomJS instances and provides an API via HTTP. It is available open-source from Github.

edited Jul 25 '16 at 10:12

answered Mar 25 '15 at 11:22

TTT

6,505
10
56
82

score 1 · Answer 5 · edited Nov 10 '15 at 14:03

if you are using nodejs why not use selenium-webdriver

run some phantomjs instance as webdriver phantomjs --webdriver=port_number

for each phantomjs instance create PhantomInstance

function PhantomInstance(port) {
    this.port = port;
}

PhantomInstance.prototype.getDriver = function() {
    var self = this;
    var driver = new webdriver.Builder()
        .forBrowser('phantomjs')
        .usingServer('http://localhost:'+self.port)
        .build();
    return driver;
}

and put all of them to one array [phantomInstance1,phantomInstance2]

create dispather.js that get free phantomInstance from array and
```
var driver = phantomInstance.getDriver();
```

This is not a good way. Trust me... in my program I used selenium-webdriver but finally I gave it up! — J.Lyu, Jun 02 '17 at 07:01

score 0 · Answer 6 · answered Oct 28 '12 at 21:45

0

If you are using nodejs, you can use https://github.com/sgentle/phantomjs-node, which will allow you to connect an arbitrary number of phantomjs process to your main NodeJS process, hence, the ability to use async.js and lots of node goodies.

answered Oct 28 '12 at 21:45

thisisnotadisplayname

112
3

This is not true. If you create more than one instance of phantom JS and run them at the same time you get 'Error: listen EADDRINUSE'. Im currently looking for a way to put the phantom instances on different ports or whatever is causing the EADDRINUSE. – RachelD Sep 12 '13 at 18:41
1

It of course your responsibility to start the phantom instances so that they listen on a different port. – thisisnotadisplayname Mar 19 '15 at 09:51

How to manage a 'pool' of PhantomJS instances

6 Answers6