3

I'm trying to get data from a live score site. I am using node.js with express.js, request.js and cheerio.js to get the HTML from a web page. It works for some parts of the HTML, but not the live parts.

I'm trying to scrape data from the web site http://www.flashresultats.com. When I use the Chrome Developer Tools I'm able to see the HTML content, but when I use my JavaScript code, the result is empty.

Here is the Chrome capture of what I am trying to extract:

HTML scrape

And here is the code I am using:

var express = require('express');
var fs = require('fs');
var request = require('request');
var cheerio = require('cheerio');
var app = express();

url = 'http://www.flashresultats.fr'

request(url, function(error, response, html){
    if(!error){
        var $ = cheerio.load(html);
        var myvar = $('#g_1_UJzOgxfc').html();
        console.log(myvar);
    }
    else {
        console.log('Error');
    }
})
Maximillian Laumeister
  • 19,884
  • 8
  • 59
  • 78
Mouette
  • 289
  • 1
  • 6
  • 17
  • Have you tried to `console.log(html)` to see if the whole HTML is empty, or if it doesn't have an elemetn with the `g_1_UJzOgxfc` id? – Buzinas Nov 03 '15 at 13:06
  • 2
    On first impression it seems that the data is loaded asynchronously on the original site, so that explains why it's not in the source of the page. You would have to find out the source of the asynchronously loaded data, and then directly load / scrape from there. – Wouter Nov 03 '15 at 13:08
  • the `console.log(html)` display the HTML of the pages, but the score don't appear in it. based on the screenshot, the ID `g_1_UJzOgxfc` exists into the HTML – Mouette Nov 03 '15 at 13:15
  • 1
    You got to use a scraper that can handle asynchronous dynamic loading of content. Or look at the Ajax calls they are making to get the content and reverse engineer them. – epascarello Nov 03 '15 at 13:19
  • http://stackoverflow.com/questions/28739098/how-can-i-scrape-pages-with-dynamic-content-using-node-js – epascarello Nov 03 '15 at 13:20

1 Answers1

5

If you get the source code of your site: view-source: http://www.flashresultats.fr/, press ctrl+f and search for g_1_UJzOgxfc node, you will not find it. It is for sure generated with some help of javascript after the initial document is loaded. That is the reason why you don't get it by sending a simple request.

So, in order to get elements that are dynamicly created you should run the javascript embed in the body recieved from your request. You can use PhantomJs bridge module to get it:

var phantom = require('phantom');

phantom.create(function (ph) {
  ph.createPage(function (page) {
    page.open("http://www.flashresultats.fr", function (status) {
      page.evaluate(function () { return document.getElementById('g_1_UJzOgxfc'); }, function (result) {
        console.log('g_1_UJzOgxfc element is:' + result);
        ph.exit();
      });
    });
  });
});
Alexandr Lazarev
  • 12,554
  • 4
  • 38
  • 47