Use PhantomJS to extract html and text

Question

I try to extract all the text content of a page (because it doesn't work with Simpledomparser)

I try to modify this simple example from the manual

var page = require('webpage').create();
console.log('The default user agent is ' + page.settings.userAgent);
page.settings.userAgent = 'SpecialAgent';
page.open('http://www.httpuseragent.org', function (status) {
    if (status !== 'success') {
        console.log('Unable to access network');
    } else {
        var ua = page.evaluate(function () {
            return document.getElementById('myagent').textContent;
        });
        console.log(ua);
    }
    phantom.exit();
});

I try to change

return document.getElementById('myagent').textContent;

to

return document.textContent;

This doesn't work.

What's the right way to do this simple thing?

score 4 · Answer 1 · answered Aug 29 '13 at 23:17

This version of your script should return the entire contents of the page:

var page = require('webpage').create();
page.settings.userAgent = 'SpecialAgent';
page.open('http://www.httpuseragent.org', function (status) {
    if (status !== 'success') {
        console.log('Unable to access network');
    } else {
        var ua = page.evaluate(function () {
            return document.getElementsByTagName('html')[0].outerHTML;
        });
        console.log(ua);
    }
    phantom.exit();
});

score 2 · Answer 2 · answered Jan 06 '15 at 10:06

There are multiple ways to retrieve the page content as a string:

page.content gives the complete source including the markup (<html>) and doctype (<!DOCTYPE html>),
document.documentElement.outerHTML (via page.evaluate) gives the complete source including the markup (<html>), but without doctype,
document.documentElement.textContent (via page.evaluate) gives the cumulative text content of the complete document including inline CSS & JavaScript, but without markup,
document.documentElement.innerText (via page.evaluate) gives the cumulative text content of the complete document excluding inline CSS & JavaScript and without markup.

document.documentElement can be exchanged by an element or query of your choice.

score 1 · Answer 3 · answered Aug 27 '13 at 06:20

1

To extract the text content of the page, you can try thisreturn document.body.textContent; but I'm not sure the result will be usable.

answered Aug 27 '13 at 06:20

Cybermaxs

24,378
8
83
112

Hi I try it but but it resturns NULL – Jay Romuald Aug 27 '13 at 07:58

score 0 · Answer 4 · edited May 23 '17 at 12:25

Having encountered this question while trying to solve a similar problem, I ended up adapting a solution from this question like so:

var fs = require('fs');
var file_h = fs.open('header.html', 'r');
var line = file_h.readLine();
var header = "";

while(!file_h.atEnd()) {

    line = file_h.readLine(); 
    header += line;

}
console.log(header);

file_h.close();
phantom.exit();

This gave me a string with the read-in HTML file that was sufficient for my purposes, and hopefully may help others who came across this.

The question seemed ambiguous (was it the entire content of the file required, or just the "text" aka Strings?) so this is one possible solution.

You don't need to use the streaming API for simply reading a file. Just use `var header = fs.read('header.html')`. — Artjom B., Jan 06 '15 at 10:42

Use PhantomJS to extract html and text

4 Answers4

Linked