-1

I'm scraping a social network with cheerio and meteor. I can log in, search for some information and scrape the page for the info I want. I'm making requests and passing the html to cheerio like Scraping with Meteor.js.

Problem is, there are a section of the page that only appears when a I load the page through a web browser:

In browser:

<div A>
    <div B>
        <ul (...)>
            <li (...)>...</li>
            ...
            <li (...)>...</li>
        </ul>
    </div> <-- end B -->
    <script id="NAME_1" type="fs/embed+m"></script>
    <script type="text/javascript">fs.dupeXHR("NAME_1","NAME_2",{"renderControl":"custom","templateId":"NAME_1"});</script>
</div> <-- end A -->

In console.log(cherio.load(html)):

<div A>
    <script id="NAME_1" type="fs/embed+m"></script>
    <script type="text/javascript">fs.dupeXHR("NAME_1","NAME_2",{"renderControl":"custom","templateId":"NAME_1"});</script>
</div> <-- end A -->

I'm supposing the html is loaded by cheerio without executing the scripts. Am I right? If so, there's some way to make cheerio execute the scripts so I can scrape the page after the content is placed?

I'm making http requests with the following options to simulate a browser request, so I think that's not a problem of the request itself (headless browsers don't make it any better).

Options = function (cookie) {
  this.headers = {
    "Accept": "*/*",
    "Connection": "keep-alive",
    "User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.132 Safari/537.36"
  };
  this.params = {};
  if (cookie) {
    this.headers.Cookie = cookie.get();
  }
};
Community
  • 1
  • 1
rnmss
  • 1
  • 1
  • 3

3 Answers3

5

You need to consider few things while scraping.

Modern sites are using the newer frameworks like Angular, EmberJS, These sites HTML are rendered using Javascript (Right click on browser window, and click View Page source, you see naked html without any HTML)

This is same with Meteor apps also.

so for these type of you need to use headeless browser like PhantomJS or ZombieJS to fetch HTML content and use it for scraping

Hope this helps

ajduke
  • 4,991
  • 7
  • 36
  • 56
  • Thanks for trying, but since I'm making requests simulating a web browser I think that headless browsers don't help at all. – rnmss Nov 17 '14 at 18:29
  • @rnmss in your code example you are (probably?) loading only 1 resource "simulating a web browser" while the headless browser's like `PhantomJS` are able to simulate complete browser's workflow, including loading and executing scripts and waiting for their completion. This is not something that `cheerio` does. So I think that there is a difference. You should be able to see it by comparing network traffic when a browser opens the page, when your code opens the page. See also http://stackoverflow.com/questions/11340038/phantomjs-not-waiting-for-full-page-load – xmojmr Nov 17 '14 at 20:05
0

Well, did some reverse engineering and found that the section unloaded can be retrieved by making a request to another page using the same options of headers, etc. Although meteor.js uses node.js behind the scenes, maybe the answers are right and this cannot be done the way I thought it could. Who knows (:

rnmss
  • 1
  • 1
  • 3
0

You are correct that your method only gets the HTML without simulating the JavaScript. To achieve what you want, consider using packages such as CasperJS or PhantomJS. Here are some examples of how to do so:

var phantomjs = Npm.require('phantomjs');
var spawn = Npm.require('child_process').spawn;
Meteor.methods({
  runTest: function(options){
    command = spawn(phantomjs.path, ['assets/app/phantomDriver.js']);
    command.stdout.on('data',  function (data) {
      console.log('stdout: ' + data);
    });
    command.stderr.on('data', function (data) {
      console.log('stderr: ' + data);
    });
    command.on('exit', function (code) {
      console.log('child process exited with code ' + code);
    });
  }
});


var page = require('webpage').create();
page.open('http://github.com/', function() {
    console.log('Page Loaded');
    page.render('github.png');
    phantom.exit();
});

References:

http://www.meteorpedia.com/read/PhantomJS

https://atmospherejs.com/gadicohen/phantomjs

Community
  • 1
  • 1
FullStack
  • 5,902
  • 4
  • 43
  • 77