3

I'm using node, request and cheerio, to fetch data from a html page. This has not been any problem but one page loads additional data through ajax to fill different containers. These are empty and undefined when the initial request is done, how do I handle this the best way?

request(url, function (error, response, html) {
if (!error && response.statusCode == 200) {

    var $ = cheerio.load(html);

    forum_url = $('.this.url.is.loaded.separatly.with.ajax'[1].attr('href');
}
});
Dilemmat_Dag
  • 383
  • 1
  • 4
  • 14
  • possible duplicate of [Incremental and non-incremental urls in node js with cheerio and request](http://stackoverflow.com/questions/25102561/incremental-and-non-incremental-urls-in-node-js-with-cheerio-and-request) – xmojmr Aug 15 '15 at 13:50

1 Answers1

5

Cheerio isn't really designed with ajax in mind. If you are able to extract the urls that need to be downloaded, you would likely have to maintain multiple seperate $ objects, as it's unlikely they can be merged easily.

Usually, in cases where you need to execute javascript found on a scraped page, we would turn to Phantom.js. Phantom is a headless browser that you control using javascript, it's pretty cool.

You can check out some Phantom.js web scraping code here: http://code4node.com/snippet/web-scraping-with-node-and-phantomjs

Let Me Tink About It
  • 15,156
  • 21
  • 98
  • 207
TylerWaite17
  • 153
  • 5
  • So there's no additional param or way to hold and wait additional time for the page to load before using the cheerio.load? Or if it's possible to use the The DOMNodeInserted event. Otherwise if there's another similar node module for this? There's has to be a workaround this, phantom is not an option for me in this case. Interested to how others solved similar problems. – Dilemmat_Dag Aug 15 '15 at 13:06
  • solved my problem by inspecting the dom and see if there where other ways to iterate through the data. I found that every ajax call used the same url with different query ids so I stored the id in a first loop then iterating through the id's using the async each series. – Dilemmat_Dag Aug 16 '15 at 13:47
  • Your link is not working anymore. `http://code4node.com/snippet/web-scraping-with-node-and-phantomjs` – Let Me Tink About It May 27 '20 at 17:39