2

There are sites whose DOM and contents are generated dynamically when the page loads. (Angularjs-based sites are notorious for this)

What approach do you use? I tried both phantomjs and jsdom but it seems I am unable get the page to execute its javascript before I scrape.

Here's a simple jsdom example (not angularjs-based but still dynamically generated)

var env = require('jsdom').env;

exports.scrape = function(link, callback) {
  var config = {
    url: link,
    headers: { 
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.110 Safari/537.36'
      },
    done: jsdomDone
  };

  env(config);
}

function jsdomDone(err, window) {
  var info = null;
  if(err) {
    console.error(err);
  } else {
    var $ = require('jquery')(window);

    console.log($('.profilePic').attr('src'));
  }
}

exports.scrape('https://www.facebook.com/elcompanies');

I tried phantomjs with moderate success.

var page = new WebPage()
var fs = require('fs');

page.onLoadFinished = function() {
  console.log("page load finished");
  window.setTimeout(function() {
    page.render('export.png');
    fs.write('1.html', page.content, 'w');
    phantom.exit();
  }, 10000);
};

page.open("https://www.facebook.com/elcompanies", function() {
  page.evaluate(function() {
  });
});

Here I wait for the onLoadFinished event and even put a 10-second timer. The interesting thing is that while my export.png image capture of the page shows a fully rendered page, my 1.html doesn't show the .profilePic class element in its rightful place. It seems to be sitting in some javascript code, surrounded by some kind of "require("TimeSlice").guard(function() {bigPipe.onPageletArrive({..." block

If you can provide me a working example that scrapes the image off this page, that'd be helpful.

kane
  • 5,465
  • 6
  • 44
  • 72
  • Not sure why my previous comment got deleted. Is there a reason why my question is being downvoted? If I'm violating SO's terms or asking something I shouldn't be, I'd like to know – kane Jan 04 '16 at 18:31
  • You're not violating any terms, and the question is fine. Just note that by attaching a bounty to the question you're attracting more eyes to it which usually results in more votes. In this case you just need to research to see when exactly the javascript is done executing on your target page, then figure out whether phantomjs or jsdom will allow you to wait that long before scraping. jsdom for example has three events it can listen to, but i don't think any of them will work in your situation (you're already using the one that gets triggered last). – Kevin B Jan 06 '16 at 15:52
  • Does this answer your question? [How can I scrape pages with dynamic content using node.js?](https://stackoverflow.com/questions/28739098/how-can-i-scrape-pages-with-dynamic-content-using-node-js) – ggorlen Jul 02 '21 at 17:12

4 Answers4

5

I've done some scraping in Facebook by using nightmarejs.
Here is a code that I did to get some content from some posts of a Facebook page.

module.exports = function checkFacebook(callback) {
var nightmare = Nightmare();
Promise.resolve(nightmare
  .viewport(1000, 1000)
  .goto('https://www.facebook.com/login/')
  .wait(2000)
  .evaluate(function(){
    document.querySelector('input[id="email"]').value = facebookEmail
    document.querySelector('input[id="pass"]').value = facebookPwd
    return true
  })
  .click('#loginbutton input')
  .wait(1000)
  .goto('https://www.facebook.com/groups/bierconomia')
  .evaluate(function(){
    var posts = document.getElementsByClassName('_1dwg')
    var length = posts.length
    var postsContent = []
    for(var i = 0; i < length; i++){
      var pTag = posts[i].getElementsByTagName('p')
      postsContent.push({
        content: pTag[0] ? pTag[0].innerText : '',
        productLink: posts[i].querySelector('a[rel = "nofollow"]') ? posts[i].querySelector('a[rel = "nofollow"]').href : '',
        photo: posts[i].getElementsByClassName('_46-i img')[0] ? posts[i].getElementsByClassName('_46-i img')[0].src : ''
      })
    }
    return postsContent
  }))
  .then(function(results){
    log(results)
    return new Promise(function(resolve, reject) {
      var leanLinks = results.map(function(result){
        return {
          post: {
            content: result.content,
            productLink: extractLinkFromFb(result.productLink),
            photo: result.photo
          }
        }
      })
      resolve(leanLinks)
    })
  })


The thing that I find useful with nightmare is that you can use the wait function to either wait for X ms or for a specific class to render.

Christian Saiki
  • 1,568
  • 15
  • 22
  • I haven't tried nightmare, but it looks promising. I'll give it a shot – kane Jan 12 '16 at 23:30
  • Just a note, Nightmare isn't headless. It depends on Electron to run, therefore it can be kind of heavy in a production environment. – Max Baldwin Mar 16 '17 at 19:38
  • Yep I've given up using nightmare js. Now I'm using node horseman -> https://github.com/johntitus/node-horseman It was pretty easy to port the code to horseman – Christian Saiki Mar 16 '17 at 21:28
1

This is because generated web pages based on AJAX calls have asynchronous AJAX calls and you can't rely on onLoad events (because data still not available).

In my personal opinion, the most reliable way would be tracing which REST services are being called from this HTML and make direct calls to them. Sometimes you will need using values found in HTML or values taken from another calls.

I know this may sound complicated, and in fact it is. You kinda need to debug page and learn what is being called. But this will work for sure.

By the way, using chrome developer tools will help this task. Just observe which call are made in network tab. You can even observe what has been sent and received in each AJAX call.

David Rissato Cruz
  • 3,347
  • 2
  • 17
  • 17
  • This might work for specific sites, but I need a general approach that renders the dynamically generated page before I scrape it – kane Jan 12 '16 at 23:24
  • So you need to use phantom if you want to execute js. I'll comment on your question because I've seen a problem there – David Rissato Cruz Jan 12 '16 at 23:30
0

If it is a one time thing, that is, if I just want to scrape a single page once, I just use the browser and artoo-js.

0

I never tried to write a page on disk using phantom, but I have two observations:

1) you are using fs.write to write things to disk, but writeFile is an async call. This means that you either need to change it to fs.writeFileSync or use a callback before closing phantom.

2) I hope you aren't expecting to write a HTML to a file and open it in a browser and get it rendered like when you saved a png, because it doesnt work this way. Some objects can be stored directly in DOM properties and certainly there are values stored in javascript variables, those things will never be persisted.

David Rissato Cruz
  • 3,347
  • 2
  • 17
  • 17
  • Re (1) The fs.write is not an issue. The html file is being written. Re (2) I was hoping to see the same DOM I see when I inspect the page. When I open the saved html, it renders correctly on my browser, but when I open the html in a notepad, it doesn't show the same DOM – kane Jan 13 '16 at 03:29