How to most efficiently parse a web page using Node.js

Question

I need to parse a simple web page and get data from html, such as "src", "data-attr", etc. How can I do this most efficiently using Node.js? If it helps, I'm using Node.js 0.8.x.

P.S. This is the site I'm parsing. I want to get a list of current tracks and make my own html5 app for listen on mobile devices.

score 59 · Accepted Answer · answered Sep 14 '12 at 18:30

I have done this a lot. You'll want to use PhantomJS if the website that you're scraping is heavily using JavaScript. Note that PhantomJS is not Node.js. It's a completely different JavaScript runtime. You can integrate through phantomjs-node or node-phantom, but they are both kinda hacky. YMMV with those. Avoid anything to do with jsdom. It'll cause you headaches - this includes Zombie.js.

What you should use is Cheerio in conjunction with Request. This will be sufficient for most web pages.

I wrote a blog post on using Cheerio with Request: Quick and Dirty Screen Scraping with Node.js But, again, if it's JavaScript intensive, use PhantomJS in conjunction with CasperJS.

Hope this helps.

Snippet using Request and Cheerio:

var request = require('request')
  , cheerio = require('cheerio');

var searchTerm = 'screen+scraping';
var url = 'http://www.bing.com/search?q=' + searchTerm;

request(url, function(err, resp, body){
  $ = cheerio.load(body);
  links = $('.sb_tlst h3 a'); //use your CSS selector here
  $(links).each(function(i, link){
    console.log($(link).text() + ':\n  ' + $(link).attr('href'));
  });
});

I have a question for you, Richardson I really hope PhantomJS can achieve this I'm thinking about, so is it possible to Interact with non-same domain site, Like login and post some thread (even on fórum, for example). I'd like to see something like this (C# sample): http://stackoverflow.com/questions/14000185/how-to-interact-with-a-website-without-a-browser — Ito, Jun 08 '14 at 13:11
@UladzimirHavenchyk yes, these are still my preferred methods. — JP Richardson, Jul 08 '15 at 11:06
What about using cheerio with phantomjs+casperjs? Then you get a faster jquery (because all you need is to scrape, not mutate the dom) and browser-side javascript! Or would it be better to just embed jquery all the time? — CMCDragonkai, Feb 23 '17 at 02:14

score 4 · Answer 2 · answered Sep 13 '12 at 10:41

4

You could try PhantomJS. Here's the documentation for using it for screen scraping.

answered Sep 13 '12 at 10:41

jabclab

14,786
5
54
51

Is it fast? I think that web-kit load system too heavy. – NiLL Sep 13 '12 at 13:28
I'm afraid I haven't used it myself, sorry. – jabclab Sep 13 '12 at 14:51
PhantomJS is slow, relatively speaking that is. – JP Richardson Sep 14 '12 at 18:34

score 3 · Answer 3 · answered Mar 27 '15 at 20:09

I agree with @JP Richardson that Cheerio is best for scraping non-JS-heavy sites. For JS-heavy sites, use Casper. It provides great abstractions over Phantom and a promises-style API. They go over how to scrape in their docs: http://docs.casperjs.org/en/latest/quickstart.html.

score 0 · Answer 4 · answered Jun 15 '14 at 03:23

If you want to go for phantom, use node-phantom. I have a git hub repository using them together to generate pdf files from html if you want to have a look. But i wouldn't go for phantom because it does more than what you usually want and cheerio is faster.

How to most efficiently parse a web page using Node.js

4 Answers4

Linked