0

Recently, I have been experimenting with the phantomjs-node library. The thing I wanted to achieve was basically to create a dynamic web-page template, employ the phantomjs-node library to "run" it, and finally extract some data from the rendered page.

In the simplest setting, the first attempt to approach this looked like this (in the example below, the template is just static, nevertheless it might contain in principle some further logic utilizing external libraries, etc.):

var phantom = require('phantom');
var co = require('co');
var sleep = require('system-sleep');
var winston = require('winston');

const logger = new winston.Logger({
    level: 'debug',
    transports: [new winston.transports.Console({
        json: false, timestamp: () => (new Date()).toLocaleString()
    })]
});

co(function*() {
    logger.info('start');
    var instance = yield phantom.create();   
    try {
        const html = `
                <!DOCTYPE html>
                <html>
                    <head>
                        <title>Page title</title>
                    </head>
                    <body>
                        <div id='results'>Page data</div>
                    </body>
                </html>
            `;

        var page = yield instance.createPage();    

        yield page.on('onLoadFinished', function(){
            logger.info('onLoadFinished');

            page.evaluate(function(){
                return document.getElementById('results').textContent;    
            }).then(function(val){
                logger.info(`RESULT = ${val}`);    
            }).catch(function(val){
                logger.error(val.message);    
            });
        });

        yield page.setContent(html, 'http://localhost');

    }catch (e){
        logger.error(e.message);       
    }finally{
        instance.exit();
    }
    logger.info('done');
});

However, this fails with the output:

12/18/2017, 2:44:32 PM - info: start
12/18/2017, 2:44:33 PM - info: done
12/18/2017, 2:44:33 PM - info: onLoadFinished
12/18/2017, 2:44:33 PM - error: Phantom process stopped with exit code 0

most likely because when the then-callback of the promise returned by page.evaluate is finally invoked, the main phantom process has already exited.

In order to "fix" this, I resorted to the following improvised strategy (omitting the rest of the example below):

    var page = yield instance.createPage();

    var resolver;
    var P = new Promise(function(resolve, reject){ resolver = resolve; });

    yield page.on('onLoadFinished', function(){
        logger.info('onLoadFinished');

        resolver(page.evaluate(function(){
            return document.getElementById('results').textContent;
        }));
    });

    yield page.setContent(html, 'http://localhost');

    const val = yield P;
    logger.info(`RESULT = ${val}`);

This essentially creates a new promise which is "externally" resolved with the promise returned from page.evaluate. The yield P statement at the end of the co block then blocks until the required result is ready, thus the output is as expected:

12/18/2017, 2:53:47 PM - info: start
12/18/2017, 2:53:48 PM - info: onLoadFinished
12/18/2017, 2:53:48 PM - info: RESULT = .....
12/18/2017, 2:53:48 PM - info: done

Although this seems to work, it feels quite "hacky" (for example exceptions thrown in the callback before the invocation of resolver won't be detected in the main try/catch block), so I was wondering what would be a cleaner approach in order to "transfer" control from the onLoadFinished callback back into the realm managed by co?

ewcz
  • 12,819
  • 1
  • 25
  • 47

1 Answers1

2
  • Don't use co + generator functions any more. async/await is here.
  • Yes, you should transform all event callbacks that fire (at most) once into promises.
  • No, don't ever create promises like that and "resolve them externally". Just put the stuff that resolves them inside the promise constructor.

(async function() {
    logger.info('start');
    var instance = await phantom.create();   
    try {
        const html = `…`;
        const page = await instance.createPage();    

        await new Promise((resolve, reject) => {
            page.on('loadFinished', resolve);
            page.on('resourceError', reject); // or something like that?
            page.setContent(html, 'http://localhost'); // this doesn't appear to return a promise
        })
        logger.info('onLoadFinished');

        try { // this extra inner try looks superfluous
            const val = await page.evaluate(function(){
                return document.getElementById('results').textContent;    
            });
            logger.info(`RESULT = ${val}`);
        } catch(e) {
            logger.error(e.message);    
        }
    } catch(e) {
        logger.error(e.message);       
    } finally {
        instance.exit();
    }
    logger.info('done');
}());
Bergi
  • 630,263
  • 148
  • 957
  • 1,375
  • thank you for the detailed comments! Would it be necessary to nest another `async` function inside the `Promise` constructor? As far as I understand, the `page.on` call is asynchronous (or in principle returns another promise) so if it indeed were, it could in principle happen that `page.setContent` executes and finishes before the `loadFinished` callback is set? – ewcz Dec 18 '17 at 20:40
  • @ewcz No. [Never pass an `async function` to the `new Promise` constructor](https://stackoverflow.com/a/43083793/1048572)! `page.on` does not seem to return a promise, that's why we are wrapping it (to promisify it). Also the *callback* is asynchronous, which means that [it always executes after `setContent` call](https://stackoverflow.com/a/31099819/1048572) (also because it's caused by it, so it can't happen before). – Bergi Dec 18 '17 at 21:14
  • I understand that the callback itself is initiated with the content being set, I meant rather the operation of assigning/registering the callback via `page.on` - as far as I understand the [source](https://github.com/amir20/phantomjs-node/blob/master/src/page.js) it is basically calling the [execute](https://github.com/amir20/phantomjs-node/blob/master/src/phantom.js) function which does return a promise (if I print to the console the return value of `page.on`, I get `Promise { }`) – ewcz Dec 18 '17 at 22:12
  • so my concern was that `page.on('loadFinished', resolve);` does register the callback (which resolves the promise once called) but this registration might happen after the content has been set. Unfortunately, it does seem to be the case, if I run your code, it just prints the `start` message and the promise does not seem to be ever resolved... – ewcz Dec 18 '17 at 22:17
  • or in other words that `page.on` constructs a promise which merely promises to register the callback (at some point) - that's why I used the `yield page.on` in order to wait until the callback is set so that the content can be set safely afterwards... – ewcz Dec 18 '17 at 22:46
  • Omg, that is sooo convoluted… Notice I changed `onLoadFinished` to `loadFinished`, maybe that's the reason? Registering callbacks asynchronously is horrible, but still the asynchronous invocations should happen in order, so I don't think `setContent` would do anything before the callback is registered. – Bergi Dec 18 '17 at 23:21
  • thanks for bearing with me! :) sorry, I didn't notice the change, `onLoadFinished` works - I was afraid that `page.setContent` is synchronous while the `page.on`s are not, but it seems to be on the same footing. However, the fact that the library assigns callbacks via promises seems to be rather peculiar... – ewcz Dec 19 '17 at 09:43