0

I'm trying to get some content off of a page that renders elements post load like such:

var URL = '/';
var IMG = 'http://img.site.com/';
for(var k=0;k<thumb.length;k++)
{
    html = html +
        '<div class="thumbnail"><a href="'+URLthumb+tumb_id[k]+'">
         <img class="th" src='IMG+thumb[k]' /></a></div>';
}
document.write(html);

I'm currently loading this page with:

var system = require('system');
var page = require('webpage').create();
page.open('http://example.com/search?q=some+query+goes+here', function() {
    var title = page.evaluate(function() {
        return document.documentElement.outerHTML;
    });
    system.stdout.writeLine(title);
    phantom.exit();
});

document.documentElement.outerHTML returns the page pre-render no matter how long I wait. What is the best object to get the page content post render?

What am I doing wrong?

Wesley
  • 5,381
  • 9
  • 42
  • 65
  • did you try throwing in a setTimeout to wait for it to render? – tells Sep 23 '15 at 16:32
  • Not yet. I'd like the page to load as fast as possible, and having it always wait 1000 ms seems like a brute force workaround. – Wesley Sep 23 '15 at 16:45
  • If you know the variables and script being used to generate the Dom elements, you could just generate that yourself. Otherwise, wouldn't you need to wait for the Js to load before scraping the site? – tells Sep 23 '15 at 16:55
  • I assume so, I'm just wondering if there's an event I can use instead of just checking back every {n} milliseconds. – Wesley Sep 23 '15 at 16:56
  • Waiting doesn't seem to help. Is `document.documentElement.outerHTML` the wrong choice for output? Is it like `View Source` and will not reflect the post processing view of the page? @tells – Wesley Sep 23 '15 at 17:12
  • possible duplicate of [phantomjs not waiting for "full" page load](http://stackoverflow.com/questions/11340038/phantomjs-not-waiting-for-full-page-load) – Artjom B. Sep 23 '15 at 17:24
  • It looks like you need to pass in a selector in your `evaluate` callback. However you might be right since the docs state: "The execution is sandboxed, the web page has no access to the phantom object and it can't probe its own setting." – tells Sep 23 '15 at 17:25
  • @tells `document.documentElement.outerHTML` is definitely part of the problem. If I run a console.log, the content is that same as view source. Need to change my selector. – Wesley Sep 23 '15 at 17:29
  • There is no such event after page load that you can use to listen to. There are a lot of ways to do this, but they partly depend on your page. The easiest would be to wait for a condition to be true such as the appearance of an element with a specific selector. – Artjom B. Sep 23 '15 at 17:30
  • @ArtjomB. @tells Looks like my selector is the main problem, and I have updated my question to reflect that. I use this method on many sites, so specific selectors may be worthless. Should I use `getElementByTagName('html')` or some such method to get the right content? – Wesley Sep 23 '15 at 17:31
  • @Wesley No, that won't make a difference. You need to wait for the "full" page load. Also, `document.documentElement.outerHTML` is not a selector. It's direct DOM access. Selectors are for example CSS selectors as in `document.querySelector(selector)` – Artjom B. Sep 23 '15 at 18:55
  • @ArtjomB. Since I'm a terrible listener, I gave it a shot and it worked. outerHTML shows the page as it first loaded, but `getElementByTagName('html')` did the trick. – Wesley Sep 23 '15 at 19:37
  • Shouldn't make a difference. Also, I see no reason to use `page.evaluate()` to go into the page context. You can just use `page.content`. – Artjom B. Sep 23 '15 at 19:39

0 Answers0