29

I need to parse a simple web page and get data from html, such as "src", "data-attr", etc. How can I do this most efficiently using Node.js? If it helps, I'm using Node.js 0.8.x.

P.S. This is the site I'm parsing. I want to get a list of current tracks and make my own html5 app for listen on mobile devices.

jacksondc
  • 600
  • 1
  • 6
  • 19
NiLL
  • 13,645
  • 14
  • 46
  • 59

4 Answers4

59

I have done this a lot. You'll want to use PhantomJS if the website that you're scraping is heavily using JavaScript. Note that PhantomJS is not Node.js. It's a completely different JavaScript runtime. You can integrate through phantomjs-node or node-phantom, but they are both kinda hacky. YMMV with those. Avoid anything to do with jsdom. It'll cause you headaches - this includes Zombie.js.

What you should use is Cheerio in conjunction with Request. This will be sufficient for most web pages.

I wrote a blog post on using Cheerio with Request: Quick and Dirty Screen Scraping with Node.js But, again, if it's JavaScript intensive, use PhantomJS in conjunction with CasperJS.

Hope this helps.

Snippet using Request and Cheerio:

var request = require('request')
  , cheerio = require('cheerio');

var searchTerm = 'screen+scraping';
var url = 'http://www.bing.com/search?q=' + searchTerm;

request(url, function(err, resp, body){
  $ = cheerio.load(body);
  links = $('.sb_tlst h3 a'); //use your CSS selector here
  $(links).each(function(i, link){
    console.log($(link).text() + ':\n  ' + $(link).attr('href'));
  });
});
JP Richardson
  • 38,609
  • 36
  • 119
  • 151
  • I have a question for you, Richardson I really hope PhantomJS can achieve this I'm thinking about, so is it possible to Interact with non-same domain site, Like login and post some thread (even on fórum, for example). I'd like to see something like this (C# sample): http://stackoverflow.com/questions/14000185/how-to-interact-with-a-website-without-a-browser – Ito Jun 08 '14 at 13:11
  • @jp-richardson is this answer still valid? – uladzimir Jul 08 '15 at 10:26
  • @UladzimirHavenchyk yes, these are still my preferred methods. – JP Richardson Jul 08 '15 at 11:06
  • What about using cheerio with phantomjs+casperjs? Then you get a faster jquery (because all you need is to scrape, not mutate the dom) and browser-side javascript! Or would it be better to just embed jquery all the time? – CMCDragonkai Feb 23 '17 at 02:14
4

You could try PhantomJS. Here's the documentation for using it for screen scraping.

jabclab
  • 14,786
  • 5
  • 54
  • 51
3

I agree with @JP Richardson that Cheerio is best for scraping non-JS-heavy sites. For JS-heavy sites, use Casper. It provides great abstractions over Phantom and a promises-style API. They go over how to scrape in their docs: http://docs.casperjs.org/en/latest/quickstart.html.

Max Heiber
  • 14,346
  • 12
  • 59
  • 97
0

If you want to go for phantom, use node-phantom. I have a git hub repository using them together to generate pdf files from html if you want to have a look. But i wouldn't go for phantom because it does more than what you usually want and cheerio is faster.

Mustafa
  • 1,738
  • 2
  • 24
  • 34